Supervised Machine Learning Techniques Assignment II with the Mortgage Probability of Default Data¶

Yi-En Tseng, Aug 9th 2023

  1. Provide EDA (The distribution of Y by X) for the following variables. Write down the business insights for the variables. EDA is about the relationship between X and Y. So you can say something like "when X is high or low, Y is high or low". Reference the variable dictionary to give the business content. If you do the EDAs in a for loop, it will be very manageable.
  • AP001, AP003, AP008
  • CR009, CR015, CR019
  • PA022, PA023, PA029
  • TD001, TD005, TD006, TD009, TD010, TD014
  1. Feature Engineering: Perform the weight-of-evidence (WOE) transformation for the above variables according to "A Data Scientist’s Toolkit to Encode Categorical Variables to NumericLinks to an external site.".

  2. Build a simple decision tree model or a logistic regression model with the above variables.

  3. Build the RF model and experiment at least two sampling methods (under-sampling or over-sampling techniques).

  4. Build (1) the GBM (Gradient Boosting Machine) model and (2) the Deep Learning model.

  5. Build (1) the GLM model and (2) the autoML model

The criteria include ROC and the cumulative Lift. Make sure you read the H2O documentationLinks to an external site. for the hyper-parameters to test accordingly. You also can select or drop the variables to improve the model performance.

  1. Apply SHAP Values to the data
  • Scikit-learn decision tree methods need appropriate variable transformation including missing imputation or categorical binning, etc. Please see "Avoid These Deadly Modeling Mistakes that May Cost You a Career". If you are more familiar with R dplyr, you can prepare your modeling data in R, save as a .csv file, then just use scikit-learn for the SHAP Values.
  • Please use the top variables learned from your previous models (RF) to run your scikit-learn random forest model.
  • Please keep the top 10 variables.
  • Please provide the following plots:
    • The summary_plot
    • The dependence_plot
    • The force_plot for 4 individual observations
  • Interpretation is essential. You will provide extensive descriptions for your top 5 variables.
  • The variable dictionary of this dataset does not provide much economic meaning. For those variables that you really cannot find more economic meanings, you can still say the relationships between the target and the predictors such as "AP003 shows a positive/megative relationship with the target variable".
  • If your dataset is large, please take 10% or 20% samples.
  • The random forest of scikit-learn needs you to create dummy variables for your categorical variables. For example if you have a variable with 4 categories, you will have 3 dummy variables. Each one has a value of 1 or 0. The 3 variables will enter your scikit-learn random forest model.

Table of Contents¶

  • Section 1 EDA
  • Section 2 Feature Engineering
    • AP001
    • AP003
    • AP008
    • CR009
    • CR015
    • CR019
    • TD001
    • TD002
    • TD006
    • TD009
    • TD010
    • TD014
    • PA022
    • PA023
    • PA029
  • Section 3 Random Forest
  • Section 4 Decision Tree
  • Section 5 GBM
  • Section 6 Deep Learning
  • Section 7 GLM
  • Section 8 AutoML
  • Section 9 SHAP

 

Section 1 EDA ¶

Understand the variables¶

Var	dtypes	    description	     Var Category
AP001	Numeric	YR_AGE	Application
AP003	Numeric	CODE_EDUCATION	Application
AP008	Numeric	FLAG_IP_CITY_NOT_APPL_CITY	Application
CR009	Numeric	AMT_LOAN_TOTAL	Credit Bureau
CR015	Numeric	MONTH_CREDIT_CARD_MOB_MAX	Credit Bureau
CR019	Numeric	SCORE_SINGLE_DEBIT_CARD_LIMIT	Credit Bureau
PA022	Numeric	DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_OR_HIGH_RISK_CALL	Call Detail
PA023	Numeric	DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_CALL	Call Detail
PA029	Numeric	AVG_LEN_COLLECTION_OR_HIGH_RISK_INBOUND_CALLS	Call Detail
TD001	Numeric	TD_CNT_QUERY_LAST_7Day_P2P	Credit Center
TD002	Numeric	TD_CNT_QUERY_LAST_7Day_SMALL_LOAN	Credit Center
TD006	Numeric	TD_CNT_QUERY_LAST_1MON_SMALL_LOAN	Credit Center
TD009	Numeric	TD_CNT_QUERY_LAST_3MON_P2P	Credit Center
TD010	Numeric	TD_CNT_QUERY_LAST_3MON_SMALL_LOAN	Credit Center
TD014	Numeric	TD_CNT_QUERY_LAST_6MON_SMALL_LOAN	Credit Center
In [1]:
import pandas as pd
#path = '/Users/yientseng/Desktop/Classes/APAN 5420/L3/'
#df = pd.read_csv(path + 'XYZloan_default_selected_vars.csv')
df = pd.read_csv('XYZloan_default_selected_vars.csv')
df.head(5)
Out[1]:
Unnamed: 0.1 Unnamed: 0 id loan_default AP001 AP002 AP003 AP004 AP005 AP006 ... CD162 CD164 CD166 CD167 CD169 CD170 CD172 CD173 MB005 MB007
0 0 1 1 1 31 2 1 12 2017/7/6 10:21 ios ... 13.0 13.0 0.0 0.0 1449.0 1449.0 2249.0 2249.0 7.0 IPHONE7
1 1 2 2 0 27 1 1 12 2017/4/6 12:51 h5 ... -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 NaN WEB
2 2 3 3 0 33 1 4 12 2017/7/1 14:11 h5 ... 3.0 2.0 33.0 0.0 33.0 0.0 143.0 110.0 8.0 WEB
3 3 4 4 0 34 2 4 12 2017/7/7 10:10 android ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0 OPPO
4 4 5 5 0 47 2 1 12 2017/7/6 14:37 h5 ... -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 NaN WEB

5 rows × 89 columns

In [2]:
columns_to_keep = ['id','loan_default','AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006','TD009', 'TD010', 'TD014']
df = df[columns_to_keep]
df.shape
df.describe()
Out[2]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014
count 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000 8.000000e+04 80000.000000 80000.000000 79619.000000 79619.000000 79619.000000 80000.000000 80000.000000 80000.000000 80000.00000 80000.000000 80000.000000
mean 40000.500000 0.193600 31.706913 2.014925 3.117200 3.518711e+04 4.924750 6.199038 19.298811 14.828822 -42.407356 1.986962 3.593037 1.345700 5.40600 2.020812 2.603662
std 23094.155105 0.395121 7.075070 1.196806 1.306335 6.359684e+04 1.094305 3.359354 39.705478 37.009374 97.006168 1.807445 2.799570 1.413362 4.02311 1.973988 2.505840
min 1.000000 0.000000 20.000000 1.000000 1.000000 0.000000e+00 2.000000 1.000000 -99.000000 -99.000000 -99.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 20000.750000 0.000000 27.000000 1.000000 2.000000 4.700000e+03 5.000000 3.000000 -1.000000 -1.000000 -98.000000 1.000000 2.000000 0.000000 3.00000 1.000000 1.000000
50% 40000.500000 0.000000 30.000000 1.000000 3.000000 1.728500e+04 5.000000 5.000000 -1.000000 -1.000000 -98.000000 2.000000 3.000000 1.000000 4.00000 2.000000 2.000000
75% 60000.250000 0.000000 35.000000 3.000000 4.000000 4.075000e+04 6.000000 10.000000 41.000000 14.000000 26.000000 3.000000 5.000000 2.000000 7.00000 3.000000 4.000000
max 80000.000000 1.000000 56.000000 6.000000 5.000000 1.420300e+06 6.000000 12.000000 448.000000 448.000000 2872.000000 20.000000 24.000000 21.000000 46.00000 35.000000 43.000000
In [3]:
AP001_type = df.dtypes['AP001']
AP003_type = df.dtypes['AP003']
AP008_type = df.dtypes['AP008']
CR009_type = df.dtypes['CR009']
CR015_type = df.dtypes['CR009']
CR019_type = df.dtypes['CR009']
PA022_type = df.dtypes['PA022']
PA023_type = df.dtypes['PA023']
PA029_type = df.dtypes['PA029']
TD001_type = df.dtypes['TD001']
TD005_type = df.dtypes['TD005']
TD006_type = df.dtypes['TD006']
TD009_type = df.dtypes['TD009']
TD010_type = df.dtypes['TD010']
TD014_type = df.dtypes['TD014']
print(AP001_type, AP003_type, AP008_type,CR009_type, CR015_type, CR019_type,PA022_type, PA023_type, PA029_type)
print(TD001_type, TD005_type, TD006_type, TD009_type, TD010_type, TD014_type)
int64 int64 int64 int64 int64 int64 float64 float64 float64
int64 int64 int64 int64 int64 int64
In [4]:
#Examine missing data, only in 3 variables 'PA022', 'PA023', 'PA029'
#is_missing_PA022 = df['PA022'].isna().any()
#is_missing_PA022 TRUE
#is_missing_PA023 = df['PA023'].isna().any()
#is_missing_PA023 TRUE
#is_missing_PA029 = df['PA029'].isna().any()
#is_missing_PA029 TRUE
In [5]:
variables = ['AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006','TD009', 'TD010', 'TD014']
target_variable = 'loan_default'

for var in variables:
    # Calculating the average loan_default for different values of X
    avg_loan_default_by_X = df.groupby(var)[target_variable].mean()
    print(f'Average {target_variable} by {var}:')
    print(avg_loan_default_by_X)
    print('\n')
Average loan_default by AP001:
AP001
20    0.221239
21    0.264848
22    0.208487
23    0.204638
24    0.200047
25    0.204809
26    0.201072
27    0.211450
28    0.198194
29    0.190930
30    0.197074
31    0.194540
32    0.189284
33    0.188460
34    0.187583
35    0.178831
36    0.178732
37    0.173545
38    0.182713
39    0.181269
40    0.194564
41    0.180445
42    0.183333
43    0.179156
44    0.181818
45    0.165138
46    0.178610
47    0.192771
48    0.169839
49    0.166124
50    0.185819
51    0.173913
52    0.193309
53    0.135036
54    0.155797
55    0.206061
56    0.160000
Name: loan_default, dtype: float64


Average loan_default by AP003:
AP003
1    0.221034
3    0.173948
4    0.125853
5    0.060345
6    0.000000
Name: loan_default, dtype: float64


Average loan_default by AP008:
AP008
1    0.168286
2    0.179188
3    0.195604
4    0.209325
5    0.209394
Name: loan_default, dtype: float64


Average loan_default by CR009:
CR009
0          0.171687
50         0.000000
99         0.000000
100        0.000000
150        0.000000
             ...   
1353000    1.000000
1368505    1.000000
1381000    0.000000
1381800    0.000000
1420300    0.000000
Name: loan_default, Length: 25883, dtype: float64


Average loan_default by CR015:
CR015
2    0.188389
3    0.247678
4    0.218583
5    0.207024
6    0.154864
Name: loan_default, dtype: float64


Average loan_default by CR019:
CR019
1     0.221311
2     0.220759
3     0.213964
4     0.212296
5     0.196685
6     0.179220
7     0.195236
8     0.182716
9     0.163152
10    0.177083
11    0.165039
12    0.163088
Name: loan_default, dtype: float64


Average loan_default by PA022:
PA022
-99.0     0.149935
-1.0      0.171054
 0.0      0.193103
 1.0      0.296117
 2.0      0.227907
            ...   
 437.0    0.000000
 440.0    0.000000
 441.0    0.000000
 445.0    1.000000
 448.0    0.000000
Name: loan_default, Length: 172, dtype: float64


Average loan_default by PA023:
PA023
-99.0     0.149935
-1.0      0.175095
 0.0      0.162393
 1.0      0.273256
 2.0      0.257310
            ...   
 434.0    1.000000
 440.0    0.000000
 441.0    0.000000
 445.0    1.000000
 448.0    0.000000
Name: loan_default, Length: 167, dtype: float64


Average loan_default by PA029:
PA029
-99.00      0.149935
-98.00      0.173775
 0.00       0.288136
 1.00       0.136364
 1.50       0.000000
              ...   
 1757.00    0.000000
 1767.75    0.000000
 1919.00    0.000000
 2014.00    1.000000
 2872.00    0.000000
Name: loan_default, Length: 4120, dtype: float64


Average loan_default by TD001:
TD001
0     0.156904
1     0.163815
2     0.197216
3     0.213688
4     0.236021
5     0.259870
6     0.277253
7     0.278652
8     0.328228
9     0.302419
10    0.259259
11    0.369048
12    0.288889
13    0.400000
14    0.466667
15    0.555556
16    0.166667
17    0.000000
18    0.500000
19    0.750000
20    1.000000
Name: loan_default, dtype: float64


Average loan_default by TD005:
TD005
0     0.132324
1     0.126238
2     0.163685
3     0.188810
4     0.201861
5     0.227266
6     0.244974
7     0.268191
8     0.265170
9     0.299129
10    0.316881
11    0.290634
12    0.332613
13    0.322884
14    0.380734
15    0.387324
16    0.371134
17    0.409091
18    0.257143
19    0.423077
20    0.277778
21    0.333333
22    0.400000
23    0.375000
24    0.400000
Name: loan_default, dtype: float64


Average loan_default by TD006:
TD006
0     0.168552
1     0.176399
2     0.207509
3     0.242584
4     0.269746
5     0.295133
6     0.325503
7     0.307167
8     0.335526
9     0.462264
10    0.327586
11    0.394737
12    0.388889
13    0.307692
14    0.285714
15    0.333333
16    0.500000
17    0.600000
18    0.333333
20    0.000000
21    1.000000
Name: loan_default, dtype: float64


Average loan_default by TD009:
TD009
0     0.113156
1     0.115699
2     0.139940
3     0.158747
4     0.177003
5     0.195302
6     0.209825
7     0.222468
8     0.239288
9     0.254988
10    0.270819
11    0.274088
12    0.291883
13    0.268065
14    0.310415
15    0.335725
16    0.276986
17    0.332432
18    0.312253
19    0.324742
20    0.391892
21    0.407407
22    0.473684
23    0.264706
24    0.486486
25    0.391304
26    0.466667
27    0.125000
28    0.526316
29    0.375000
30    0.444444
31    0.200000
32    0.666667
33    0.333333
34    0.750000
36    0.000000
38    0.000000
39    0.000000
46    1.000000
Name: loan_default, dtype: float64


Average loan_default by TD010:
TD010
0     0.152064
1     0.163854
2     0.191281
3     0.223402
4     0.248087
5     0.275062
6     0.280515
7     0.298647
8     0.296015
9     0.353741
10    0.350000
11    0.370370
12    0.310811
13    0.347826
14    0.433333
15    0.481481
16    0.555556
17    0.647059
18    0.384615
19    0.000000
20    0.500000
21    0.666667
22    0.428571
23    0.000000
24    0.800000
25    0.500000
26    0.000000
28    0.500000
30    1.000000
35    1.000000
Name: loan_default, dtype: float64


Average loan_default by TD014:
TD014
0     0.142579
1     0.155745
2     0.179754
3     0.205032
4     0.237809
5     0.252252
6     0.266280
7     0.294633
8     0.291915
9     0.306410
10    0.311155
11    0.321101
12    0.331839
13    0.297297
14    0.340909
15    0.428571
16    0.446154
17    0.416667
18    0.233333
19    0.363636
20    0.350000
21    0.312500
22    0.636364
23    0.600000
24    0.571429
25    0.666667
26    0.333333
27    1.000000
28    0.500000
30    0.000000
31    0.000000
32    0.000000
36    1.000000
43    1.000000
Name: loan_default, dtype: float64


In [6]:
# Plot for each column
import matplotlib.pyplot as plt
def plot_histogram(data_frame, column_name):
    %matplotlib inline
    # Check if the column_name exists in the DataFrame
    if column_name not in data_frame.columns:
        raise ValueError(f"Column '{column_name}' does not exist in the DataFrame.")
    # Plot the histogram
    data_frame[column_name].hist()
In [7]:
plot_histogram(df, 'AP003')
In [8]:
# Plot for all columns
def plot_histograms_for_all_columns(data_frame):
    %matplotlib inline
    for column_name in data_frame.columns:
        data_frame[column_name].hist()
        plt.title(f'Histogram of {column_name}')
        plt.xlabel(column_name)
        plt.ylabel('Frequency')
        plt.show()

plot_histograms_for_all_columns(df)

Section 2 Feature Engineering ¶

Define Functions & Split Data¶

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split

# Variables for WOE transformation
variables = ['AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014']
target_variable = 'loan_default'

# Splitting data into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
In [10]:
#Define function of WOE for train data
def WOE(var):
    train_df[var] = train_df[var].fillna('NoData')
    k = train_df[[var,'loan_default']].groupby(var)['loan_default'].agg(['count','sum']).reset_index()
    k.columns = [var,'Count','Good']
    k['Bad'] = k['Count'] - k['Good']
    k['Good %'] = (k['Good'] / k['Good'].sum()*100).round(2)
    k['Bad %'] = (k['Bad'] / k['Bad'].sum()*100).round(2)
    k[var+'_WOE'] = np.log(k['Good %'] / k['Bad %']).round(2)
    k = k.sort_values(by=var+'_WOE')
    return(k)

AP001 (YR_AGE): ¶

  • A numeric variable representing the applicant's age.
  • By analyzing the histogram and average loan_default values, we can understand if there is any correlation between age and the likelihood of defaulting on a loan. For instance, if younger individuals have a higher average loan_default, it suggests that age may be a significant factor in loan default risk.
In [11]:
k = WOE('AP001')
k
Out[11]:
AP001 Count Good Bad Good % Bad % AP001_WOE
33 53 224 31 193 0.25 0.37 -0.39
31 51 294 47 247 0.38 0.48 -0.23
34 54 220 37 183 0.30 0.35 -0.15
28 48 530 90 440 0.73 0.85 -0.15
25 45 788 135 653 1.09 1.26 -0.14
17 37 1530 264 1266 2.14 2.45 -0.14
16 36 1941 338 1603 2.74 3.10 -0.12
29 49 494 87 407 0.71 0.79 -0.11
24 44 790 140 650 1.13 1.26 -0.11
19 39 1325 234 1091 1.90 2.11 -0.10
26 46 749 133 616 1.08 1.19 -0.10
15 35 2321 417 1904 3.38 3.69 -0.09
21 41 1022 184 838 1.49 1.62 -0.08
30 50 321 58 263 0.47 0.51 -0.08
23 43 916 167 749 1.35 1.45 -0.07
18 38 1457 270 1187 2.19 2.30 -0.05
13 33 2764 516 2248 4.18 4.35 -0.04
27 47 658 123 535 1.00 1.04 -0.04
14 34 2393 449 1944 3.64 3.76 -0.03
9 29 4056 761 3295 6.17 6.38 -0.03
22 42 944 179 765 1.45 1.48 -0.02
12 32 2947 561 2386 4.55 4.62 -0.02
36 56 19 4 15 0.03 0.03 0.00
11 31 3718 724 2994 5.87 5.80 0.01
20 40 1110 217 893 1.76 1.73 0.02
10 30 4358 870 3488 7.05 6.75 0.04
8 28 4704 936 3768 7.59 7.29 0.04
6 26 4378 875 3503 7.09 6.78 0.04
5 25 3938 784 3154 6.35 6.11 0.04
4 24 3426 681 2745 5.52 5.31 0.04
3 23 2330 465 1865 3.77 3.61 0.04
35 55 131 26 105 0.21 0.20 0.05
32 52 219 44 175 0.36 0.34 0.06
7 27 5081 1060 4021 8.59 7.78 0.10
2 22 1310 278 1032 2.25 2.00 0.12
0 20 86 18 68 0.15 0.13 0.14
1 21 508 135 373 1.09 0.72 0.41
In [12]:
#Append the WOE value of feature back to the original train data
#train_df_AP001_WOE = pd.merge(train_df[['loan_default','AP001']],k[['AP001','AP001_WOE']],
#     left_on='AP001',
#     right_on='AP001',how='left')
#train_df_AP001_WOE.head(10)

train_df_WOE_AP001 = pd.merge(train_df, k[['AP001', 'AP001_WOE']],
                             left_on='AP001',
                             right_on='AP001', how='left')
train_df_WOE_AP001.head(10)
Out[12]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP001_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 -98.0 5 8 3 14 5 5 -0.03
1 35563 1 47 1 2 0 6 12 87.0 87.0 17.5 2 2 0 2 1 1 -0.04
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 -98.0 2 3 1 6 2 2 0.01
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 -98.0 5 9 3 9 3 3 -0.03
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 -98.0 2 2 0 2 0 0 -0.09
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 -98.0 5 11 3 11 4 4 0.04
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 -98.0 3 4 1 6 3 3 -0.09
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 -98.0 4 4 1 6 1 1 0.04
8 56100 1 26 3 5 20799 5 5 12.0 12.0 96.0 4 9 1 10 1 2 0.04
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 52.0 2 3 3 5 3 3 -0.14
In [13]:
#Append the WOE table to the test data
test_df_WOE_AP001 = pd.merge(test_df, k[['AP001', 'AP001_WOE']],
                             left_on='AP001',
                             right_on='AP001', how='left')
test_df_WOE_AP001.head(10)
Out[13]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP001_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 -98.000000 2 2 1 2 1 1 0.04
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 -98.000000 2 4 1 7 1 2 -0.04
2 74784 0 29 4 5 33000 5 11 51.0 51.0 7.000000 1 3 1 4 1 1 -0.03
3 70976 1 28 1 5 3000 5 3 85.0 85.0 120.285714 1 1 3 1 3 4 0.04
4 46646 0 27 1 3 48219 5 11 58.0 58.0 180.000000 4 7 2 15 5 6 0.10
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 -98.000000 5 7 0 8 3 4 -0.04
6 65510 0 23 3 1 8100 2 3 75.0 75.0 139.000000 9 14 6 25 8 11 0.04
7 62716 0 36 1 3 0 5 3 115.0 115.0 17.000000 2 3 0 3 1 2 -0.12
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 -98.000000 2 5 1 8 2 6 0.41
9 58835 0 24 3 2 60877 5 10 52.0 23.0 164.000000 3 3 1 6 2 3 0.04
In [14]:
#Append the WOE table to the test data
#test_df_AP001_WOE = pd.merge(test_df[['loan_default','AP001']],k[['AP001','AP001_WOE']],
#     left_on='AP001',
#     right_on='AP001',how='left')
#test_df_AP001_WOE.head(10)
In [15]:
nan_check = test_df_WOE_AP001['AP001_WOE'].isna()
nan_values = test_df_WOE_AP001['AP001_WOE'][nan_check]
nan_values
Out[15]:
Series([], Name: AP001_WOE, dtype: float64)
In [16]:
nan_check = train_df_WOE_AP001['AP001_WOE'].isna()
nan_values = train_df_WOE_AP001['AP001_WOE'][nan_check]
nan_values
Out[16]:
Series([], Name: AP001_WOE, dtype: float64)

AP003 (CODE_EDUCATION): ¶

  • A numeric variable representing the applicant's education level.
  • This variable's distribution and average loan_default values can provide insights into the relationship between education level and loan default. If higher education levels correspond to lower average loan_default, it indicates that education may serve as a protective factor against default.
In [17]:
k = WOE('AP003')
k
#Need to bin this variables
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[17]:
AP003 Count Good Bad Good % Bad % AP003_WOE
4 6 12 0 12 0.00 0.02 -inf
3 5 187 11 176 0.09 0.34 -1.33
2 4 8672 1107 7565 8.97 14.64 -0.49
1 3 19072 3301 15771 26.75 30.53 -0.13
0 1 36057 7919 28138 64.18 54.47 0.16
In [18]:
#train_df['AP003_bin'] = pd.qcut(train_df['AP003'],5,duplicates='drop').values.add_categories("NoData")
#train_df['AP003_bin'] = train_df['AP003_bin'].fillna("NoData").astype(str)
#train_df['AP003_bin'].value_counts(dropna=False)

#pd.cut: Given the values 0, 1, 3, 4, and 5, here's how they would be categorized based on the default behavior:
#0 will belong to the bin interval [0, 1.2)
#1 will belong to the bin interval [0, 1.2)
#3 will belong to the bin interval [2.4, 3.6)
#4 will belong to the bin interval [3.6, 4.8)
#5 will belong to the bin interval [4.8, 6)

#train_df['AP003_bin'] = pd.cut(train_df['AP003'], bins=5, duplicates='drop', labels=['Category 1', 'Category 2', 'Category 3', 'Category 4', 'Category 5'])
#train_df['AP003_bin'] = train_df['AP003_bin'].astype(str).fillna("NoData")
#train_df['AP003_bin'].value_counts(dropna=False)

#Still has -inf value
In [19]:
#Bin the train data
train_df['AP003_bin'] = pd.qcut(train_df['AP003'],5,duplicates='drop').values.add_categories("NoData")
train_df['AP003_bin'] = train_df['AP003_bin'].fillna("NoData").astype(str)
train_df['AP003_bin'].value_counts(dropna=False)
Out[19]:
(0.999, 3.0]    55129
(3.0, 6.0]       8871
Name: AP003_bin, dtype: int64
In [20]:
k = WOE('AP003_bin')
k
Out[20]:
AP003_bin Count Good Bad Good % Bad % AP003_bin_WOE
1 (3.0, 6.0] 8871 1118 7753 9.06 15.01 -0.50
0 (0.999, 3.0] 55129 11220 43909 90.94 84.99 0.07
In [21]:
train_df
Out[21]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin
3822 3823 0 29 4 2 37635 5 5 -1.0 -1.0 -98.0 5 8 3 14 5 5 (3.0, 6.0]
35562 35563 1 47 1 2 0 6 12 87.0 87.0 17.5 2 2 0 2 1 1 (0.999, 3.0]
4883 4884 0 31 1 5 47506 5 12 -1.0 -1.0 -98.0 2 3 1 6 2 2 (0.999, 3.0]
71170 71171 0 29 3 4 22037 6 5 -1.0 -1.0 -98.0 5 9 3 9 3 3 (0.999, 3.0]
25665 25666 0 35 4 3 67400 6 7 -1.0 -1.0 -98.0 2 2 0 2 0 0 (3.0, 6.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6265 6266 0 25 3 3 12000 5 3 -1.0 -1.0 -98.0 4 4 1 5 1 2 (0.999, 3.0]
54886 54887 0 31 3 4 60300 6 5 69.0 -1.0 39.0 2 4 1 5 1 1 (0.999, 3.0]
76820 76821 0 28 3 2 45167 5 3 -1.0 -1.0 -98.0 2 13 3 14 3 3 (0.999, 3.0]
860 861 1 28 1 5 59111 6 11 -1.0 -1.0 -98.0 1 2 2 8 2 2 (0.999, 3.0]
15795 15796 0 27 1 4 2878 5 2 -1.0 -1.0 -98.0 1 1 1 3 1 1 (0.999, 3.0]

64000 rows × 18 columns

In [22]:
train_df_WOE_AP003 = pd.merge(train_df, k[['AP003_bin', 'AP003_bin_WOE']],
                             left_on='AP003_bin',
                             right_on='AP003_bin', how='left')
train_df_WOE_AP003.head(10)
Out[22]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin AP003_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 -98.0 5 8 3 14 5 5 (3.0, 6.0] -0.50
1 35563 1 47 1 2 0 6 12 87.0 87.0 17.5 2 2 0 2 1 1 (0.999, 3.0] 0.07
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 -98.0 2 3 1 6 2 2 (0.999, 3.0] 0.07
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 -98.0 5 9 3 9 3 3 (0.999, 3.0] 0.07
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 -98.0 2 2 0 2 0 0 (3.0, 6.0] -0.50
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 -98.0 5 11 3 11 4 4 (0.999, 3.0] 0.07
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 -98.0 3 4 1 6 3 3 (0.999, 3.0] 0.07
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 -98.0 4 4 1 6 1 1 (0.999, 3.0] 0.07
8 56100 1 26 3 5 20799 5 5 12.0 12.0 96.0 4 9 1 10 1 2 (0.999, 3.0] 0.07
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 52.0 2 3 3 5 3 3 (0.999, 3.0] 0.07
In [23]:
#train_df_WOE_AP003_usedtomerge = train_df_WOE_AP003.drop(columns=train_df_WOE_AP003.columns.difference(['AP003', 'AP003_bin']))
#train_df_WOE_AP003_usedtomerge
In [24]:
#Merge the WOE value of each category with the train data
#train_df_AP003_WOE = pd.merge(train_df[['loan_default','AP003''AP003_bin']],k[['AP003_bin','AP003_bin_WOE']],
#     left_on='AP003_bin',
#     right_on='AP003_bin',how='left')
#train_df_AP003_WOE.head(10)

#train_df_WOE = pd.merge(train_df_WOE, train_df_usedtomerge[['AP003', 'AP003_bin']],
#                             left_on='AP003',
#                             right_on='AP003', how='left')
#train_df_WOE.head(10)
In [25]:
nan_check = train_df_WOE_AP003['AP003_bin_WOE'].isna()
nan_values = train_df_WOE_AP003['AP003_bin_WOE'][nan_check]
nan_values
Out[25]:
Series([], Name: AP003_bin_WOE, dtype: float64)
In [26]:
#Append the WOE value of each category back to the original train data
#train_df['AP003_WOE']=train_df_WOE_AP003['AP003_bin_WOE']
In [27]:
# Define the desired bin labels
bin_labels = ["(0.999, 3.0]", "(3.0, 6.0]"]
# Bin the test data with the specified labels
test_df['AP003_bin_labels'] = pd.qcut(test_df['AP003'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['AP003_bin'] = pd.qcut(test_df['AP003'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['AP003_bin'] = test_df['AP003_bin'].fillna("NoData")
# Print the value counts
test_df['AP003_bin'].value_counts(dropna=False)
Out[27]:
(0.999, 3.0]    13779
(3.0, 6.0]       2221
Name: AP003_bin, dtype: int64
In [28]:
test_df
Out[28]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin_labels AP003_bin
47044 47045 0 30 3 3 10000 5 5 25.0 25.0 -98.000000 2 2 1 2 1 1 0 (0.999, 3.0]
44295 44296 0 33 3 5 27288 5 5 -1.0 -1.0 -98.000000 2 4 1 7 1 2 0 (0.999, 3.0]
74783 74784 0 29 4 5 33000 5 11 51.0 51.0 7.000000 1 3 1 4 1 1 1 (3.0, 6.0]
70975 70976 1 28 1 5 3000 5 3 85.0 85.0 120.285714 1 1 3 1 3 4 0 (0.999, 3.0]
46645 46646 0 27 1 3 48219 5 11 58.0 58.0 180.000000 4 7 2 15 5 6 0 (0.999, 3.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67666 67667 0 41 1 5 46967 6 11 56.0 56.0 0.000000 2 3 2 4 4 4 0 (0.999, 3.0]
51146 51147 0 39 1 2 25796 6 2 91.0 91.0 59.500000 4 11 3 14 4 5 0 (0.999, 3.0]
42494 42495 1 31 1 2 0 5 3 -1.0 -1.0 -98.000000 3 3 1 3 1 2 0 (0.999, 3.0]
52517 52518 0 34 1 1 3600 3 2 -1.0 -1.0 -98.000000 3 3 1 3 1 2 0 (0.999, 3.0]
7754 7755 0 43 3 2 52000 6 10 -1.0 -1.0 -98.000000 2 5 1 10 3 5 0 (0.999, 3.0]

16000 rows × 19 columns

In [29]:
#Append the WOE table to the test data
#test_df_WOE_AP003 = pd.merge(test_df[['id','loan_default','AP003','AP003_bin']],k[['AP003_bin','AP003_bin_WOE']],
#     left_on='AP003_bin',
#     right_on='AP003_bin',how='left')
#test_df_AP003_WOE.head(10)
#TD010 way
test_df_WOE_AP003 = pd.merge(test_df, k[['AP003_bin', 'AP003_bin_WOE']],
                             left_on='AP003_bin',
                             right_on='AP003_bin', how='left')
test_df_WOE_AP003.head(10)
Out[29]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin_labels AP003_bin AP003_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 -98.000000 2 2 1 2 1 1 0 (0.999, 3.0] 0.07
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 -98.000000 2 4 1 7 1 2 0 (0.999, 3.0] 0.07
2 74784 0 29 4 5 33000 5 11 51.0 51.0 7.000000 1 3 1 4 1 1 1 (3.0, 6.0] -0.50
3 70976 1 28 1 5 3000 5 3 85.0 85.0 120.285714 1 1 3 1 3 4 0 (0.999, 3.0] 0.07
4 46646 0 27 1 3 48219 5 11 58.0 58.0 180.000000 4 7 2 15 5 6 0 (0.999, 3.0] 0.07
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 -98.000000 5 7 0 8 3 4 1 (3.0, 6.0] -0.50
6 65510 0 23 3 1 8100 2 3 75.0 75.0 139.000000 9 14 6 25 8 11 0 (0.999, 3.0] 0.07
7 62716 0 36 1 3 0 5 3 115.0 115.0 17.000000 2 3 0 3 1 2 0 (0.999, 3.0] 0.07
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 -98.000000 2 5 1 8 2 6 0 (0.999, 3.0] 0.07
9 58835 0 24 3 2 60877 5 10 52.0 23.0 164.000000 3 3 1 6 2 3 0 (0.999, 3.0] 0.07
In [30]:
nan_check = test_df_WOE_AP003['AP003_bin_WOE'].isna()
nan_values = test_df_WOE_AP003['AP003_bin_WOE'][nan_check]
nan_values
Out[30]:
Series([], Name: AP003_bin_WOE, dtype: float64)

AP008 (FLAG_IP_CITY_NOT_APPL_CITY): ¶

  • Numeric variable indicating whether the applicant's current city is different from the city of application.
  • This flag variable can reveal the impact of residing in the same city as the applied city on loan default, suggesting that geographical factors play a role in loan repayment behavior.
In [31]:
k = WOE('AP008')
k
Out[31]:
AP008 Count Good Bad Good % Bad % AP008_WOE
0 1 6788 1107 5681 8.97 11.00 -0.20
1 2 17470 3119 14351 25.28 27.78 -0.09
2 3 14818 2902 11916 23.52 23.07 0.02
3 4 11381 2356 9025 19.10 17.47 0.09
4 5 13543 2854 10689 23.13 20.69 0.11
In [32]:
#Append the WOE value of feature back to the original train data
train_df_WOE_AP008 = pd.merge(train_df, k[['AP008', 'AP008_WOE']],
                             left_on='AP008',
                             right_on='AP008', how='left')
train_df_WOE_AP008.head(10)
Out[32]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin AP008_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 -98.0 5 8 3 14 5 5 (3.0, 6.0] -0.09
1 35563 1 47 1 2 0 6 12 87.0 87.0 17.5 2 2 0 2 1 1 (0.999, 3.0] -0.09
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 -98.0 2 3 1 6 2 2 (0.999, 3.0] 0.11
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 -98.0 5 9 3 9 3 3 (0.999, 3.0] 0.09
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 -98.0 2 2 0 2 0 0 (3.0, 6.0] 0.02
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 -98.0 5 11 3 11 4 4 (0.999, 3.0] -0.09
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 -98.0 3 4 1 6 3 3 (0.999, 3.0] 0.11
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 -98.0 4 4 1 6 1 1 (0.999, 3.0] 0.11
8 56100 1 26 3 5 20799 5 5 12.0 12.0 96.0 4 9 1 10 1 2 (0.999, 3.0] 0.11
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 52.0 2 3 3 5 3 3 (0.999, 3.0] 0.02
In [33]:
#Append the WOE table to the test data
test_df_WOE_AP008 = pd.merge(test_df, k[['AP008', 'AP008_WOE']],
                             left_on='AP008',
                             right_on='AP008', how='left')
test_df_WOE_AP008.head(10)
Out[33]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin_labels AP003_bin AP008_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 -98.000000 2 2 1 2 1 1 0 (0.999, 3.0] 0.02
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 -98.000000 2 4 1 7 1 2 0 (0.999, 3.0] 0.11
2 74784 0 29 4 5 33000 5 11 51.0 51.0 7.000000 1 3 1 4 1 1 1 (3.0, 6.0] 0.11
3 70976 1 28 1 5 3000 5 3 85.0 85.0 120.285714 1 1 3 1 3 4 0 (0.999, 3.0] 0.11
4 46646 0 27 1 3 48219 5 11 58.0 58.0 180.000000 4 7 2 15 5 6 0 (0.999, 3.0] 0.02
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 -98.000000 5 7 0 8 3 4 1 (3.0, 6.0] -0.20
6 65510 0 23 3 1 8100 2 3 75.0 75.0 139.000000 9 14 6 25 8 11 0 (0.999, 3.0] -0.20
7 62716 0 36 1 3 0 5 3 115.0 115.0 17.000000 2 3 0 3 1 2 0 (0.999, 3.0] 0.02
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 -98.000000 2 5 1 8 2 6 0 (0.999, 3.0] 0.02
9 58835 0 24 3 2 60877 5 10 52.0 23.0 164.000000 3 3 1 6 2 3 0 (0.999, 3.0] -0.09
In [34]:
nan_check = test_df_WOE_AP008['AP008_WOE'].isna()
nan_values = test_df_WOE_AP008['AP008_WOE'][nan_check]
nan_values
Out[34]:
Series([], Name: AP008_WOE, dtype: float64)
In [35]:
nan_check = train_df_WOE_AP008['AP008_WOE'].isna()
nan_values = train_df_WOE_AP008['AP008_WOE'][nan_check]
nan_values
Out[35]:
Series([], Name: AP008_WOE, dtype: float64)

CR009 (AMT_LOAN_TOTAL): ¶

  • A numeric variable representing the total loan amount reported by the credit bureau.
  • This variable represents the total loan amount reported by the credit bureau. Higher values of AMT_LOAN_TOTAL indicate larger loan amounts, suggesting that individuals with higher loan amounts may have a greater financial commitment or higher borrowing capacity.
In [36]:
k = WOE('CR009')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[36]:
CR009 Count Good Bad Good % Bad % CR009_WOE
3296 12050 4 0 4 0.0 0.01 -inf
3741 13257 4 0 4 0.0 0.01 -inf
17901 88500 5 0 5 0.0 0.01 -inf
3695 13125 3 0 3 0.0 0.01 -inf
8695 27288 3 0 3 0.0 0.01 -inf
... ... ... ... ... ... ... ...
21847 1214822 1 0 1 0.0 0.00 NaN
21848 1238000 1 0 1 0.0 0.00 NaN
21849 1243934 1 0 1 0.0 0.00 NaN
21851 1381000 1 0 1 0.0 0.00 NaN
21852 1420300 1 0 1 0.0 0.00 NaN

21853 rows × 7 columns

In [37]:
#Bin the train data
train_df['CR009_bin'] = pd.qcut(train_df['CR009'],5,duplicates='drop').values.add_categories("NoData")
train_df['CR009_bin'] = train_df['CR009_bin'].fillna("NoData").astype(str)
train_df['CR009_bin'].value_counts(dropna=False)
Out[37]:
(24221.8, 50000.0]      13072
(-0.001, 2500.0]        13072
(11484.4, 24221.8]      12800
(50000.0, 1420300.0]    12528
(2500.0, 11484.4]       12528
Name: CR009_bin, dtype: int64
In [38]:
k = WOE('CR009_bin')
k
Out[38]:
CR009_bin Count Good Bad Good % Bad % CR009_bin_WOE
4 (50000.0, 1420300.0] 12528 2158 10370 17.49 20.07 -0.14
0 (-0.001, 2500.0] 13072 2338 10734 18.95 20.78 -0.09
1 (11484.4, 24221.8] 12800 2615 10185 21.19 19.71 0.07
2 (24221.8, 50000.0] 13072 2658 10414 21.54 20.16 0.07
3 (2500.0, 11484.4] 12528 2569 9959 20.82 19.28 0.08
In [39]:
#Append the WOE value of each category back to the original train data
train_df_WOE_CR009 = pd.merge(train_df, k[['CR009_bin', 'CR009_bin_WOE']],
                             left_on='CR009_bin',
                             right_on='CR009_bin', how='left')
train_df_WOE_CR009.head(10)
Out[39]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 PA029 TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin CR009_bin CR009_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 -98.0 5 8 3 14 5 5 (3.0, 6.0] (24221.8, 50000.0] 0.07
1 35563 1 47 1 2 0 6 12 87.0 87.0 17.5 2 2 0 2 1 1 (0.999, 3.0] (-0.001, 2500.0] -0.09
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 -98.0 2 3 1 6 2 2 (0.999, 3.0] (24221.8, 50000.0] 0.07
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 -98.0 5 9 3 9 3 3 (0.999, 3.0] (11484.4, 24221.8] 0.07
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 -98.0 2 2 0 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] -0.14
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 -98.0 5 11 3 11 4 4 (0.999, 3.0] (24221.8, 50000.0] 0.07
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 -98.0 3 4 1 6 3 3 (0.999, 3.0] (-0.001, 2500.0] -0.09
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 -98.0 4 4 1 6 1 1 (0.999, 3.0] (-0.001, 2500.0] -0.09
8 56100 1 26 3 5 20799 5 5 12.0 12.0 96.0 4 9 1 10 1 2 (0.999, 3.0] (11484.4, 24221.8] 0.07
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 52.0 2 3 3 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] -0.14
In [40]:
nan_check = train_df_WOE_CR009['CR009_bin_WOE'].isna()
nan_values = train_df_WOE_CR009['CR009_bin_WOE'][nan_check]
nan_values
Out[40]:
Series([], Name: CR009_bin_WOE, dtype: float64)
In [41]:
# Define the desired bin labels
bin_labels = ["(24221.8, 50000.0]", "(-0.001, 2500.0]","(11484.4, 24221.8]","(50000.0, 1420300.0]","(2500.0, 11484.4]"]
# Bin the test data with the specified labels
test_df['CR009_bin_labels'] = pd.qcut(test_df['CR009'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['CR009_bin'] = pd.qcut(test_df['CR009'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['CR009_bin'] = test_df['CR009_bin'].fillna("NoData")
# Print the value counts
test_df['CR009_bin'].value_counts(dropna=False)
Out[41]:
(24221.8, 50000.0]      3265
(50000.0, 1420300.0]    3209
(11484.4, 24221.8]      3207
(2500.0, 11484.4]       3176
(-0.001, 2500.0]        3143
Name: CR009_bin, dtype: int64
In [42]:
test_df_WOE_CR009 = pd.merge(test_df, k[['CR009_bin', 'CR009_bin_WOE']],
                             left_on='CR009_bin',
                             right_on='CR009_bin', how='left')
test_df_WOE_CR009.head(10)
Out[42]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD005 TD006 TD009 TD010 TD014 AP003_bin_labels AP003_bin CR009_bin_labels CR009_bin CR009_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 2 1 2 1 1 0 (0.999, 3.0] 1 (-0.001, 2500.0] -0.09
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 4 1 7 1 2 0 (0.999, 3.0] 3 (50000.0, 1420300.0] -0.14
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 3 1 4 1 1 1 (3.0, 6.0] 3 (50000.0, 1420300.0] -0.14
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 1 3 1 3 4 0 (0.999, 3.0] 1 (-0.001, 2500.0] -0.09
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 7 2 15 5 6 0 (0.999, 3.0] 3 (50000.0, 1420300.0] -0.14
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 7 0 8 3 4 1 (3.0, 6.0] 1 (-0.001, 2500.0] -0.09
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 14 6 25 8 11 0 (0.999, 3.0] 1 (-0.001, 2500.0] -0.09
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 3 0 3 1 2 0 (0.999, 3.0] 0 (24221.8, 50000.0] 0.07
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 5 1 8 2 6 0 (0.999, 3.0] 2 (11484.4, 24221.8] 0.07
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 3 1 6 2 3 0 (0.999, 3.0] 4 (2500.0, 11484.4] 0.08

10 rows × 22 columns

In [43]:
nan_check = test_df_WOE_CR009['CR009_bin_WOE'].isna()
nan_values = test_df_WOE_CR009['CR009_bin_WOE'][nan_check]
nan_values
Out[43]:
Series([], Name: CR009_bin_WOE, dtype: float64)

CR015 (MONTH_CREDIT_CARD_MOB_MAX):¶

  • This variable represents the maximum monthly outstanding balance of a credit card reported by the credit bureau. Higher values of CR015 indicate higher maximum outstanding balances on credit cards, suggesting individuals with higher credit utilization or larger balances on their credit card accounts.
In [44]:
k = WOE('CR015')
k
Out[44]:
CR015 Count Good Bad Good % Bad % CR015_WOE
4 6 21562 3337 18225 27.05 35.28 -0.27
0 2 2676 503 2173 4.08 4.21 -0.03
3 5 27500 5641 21859 45.72 42.31 0.08
2 4 5870 1278 4592 10.36 8.89 0.15
1 3 6392 1579 4813 12.80 9.32 0.32
In [45]:
#Bin the train data
train_df['CR015_bin'] = pd.qcut(train_df['CR015'],5,duplicates='drop').values.add_categories("NoData")
train_df['CR015_bin'] = train_df['CR015_bin'].fillna("NoData").astype(str)
train_df['CR015_bin'].value_counts(dropna=False)
Out[45]:
(4.0, 5.0]      27500
(5.0, 6.0]      21562
(1.999, 4.0]    14938
Name: CR015_bin, dtype: int64
In [46]:
k = WOE('CR015_bin')
k
Out[46]:
CR015_bin Count Good Bad Good % Bad % CR015_bin_WOE
2 (5.0, 6.0] 21562 3337 18225 27.05 35.28 -0.27
1 (4.0, 5.0] 27500 5641 21859 45.72 42.31 0.08
0 (1.999, 4.0] 14938 3360 11578 27.23 22.41 0.19
In [47]:
#Append the WOE value of each category back to the original train data
train_df_WOE_CR015 = pd.merge(train_df,k[['CR015_bin','CR015_bin_WOE']],
     left_on='CR015_bin',
     right_on='CR015_bin',how='left')
train_df_WOE_CR015.head(10)
Out[47]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin CR009_bin CR015_bin CR015_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 5 8 3 14 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] 0.08
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 2 2 0 2 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] -0.27
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 2 3 1 6 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] 0.08
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 5 9 3 9 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] -0.27
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 2 2 0 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] -0.27
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 5 11 3 11 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] 0.08
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 3 4 1 6 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] -0.27
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 4 4 1 6 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] 0.19
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 4 9 1 10 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] 0.08
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 2 3 3 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] 0.08

10 rows × 21 columns

In [48]:
nan_check = train_df_WOE_CR015['CR015_bin_WOE'].isna()
nan_values = train_df_WOE_CR015['CR015_bin_WOE'][nan_check]
nan_values
Out[48]:
Series([], Name: CR015_bin_WOE, dtype: float64)
In [49]:
# Define the desired bin labels
bin_labels = ["(4.0, 5.0]", "(5.0, 6.0]","(1.999, 4.0]"]
# Bin the test data with the specified labels
test_df['CR015_bin_labels'] = pd.qcut(test_df['CR015'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['CR015_bin'] = pd.qcut(test_df['CR015'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['CR015_bin'] = test_df['CR015_bin'].fillna("NoData")
# Print the value counts
test_df['CR015_bin'].value_counts(dropna=False)
Out[49]:
(5.0, 6.0]      6839
(1.999, 4.0]    5565
(4.0, 5.0]      3596
Name: CR015_bin, dtype: int64
In [50]:
test_df
Out[50]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD006 TD009 TD010 TD014 AP003_bin_labels AP003_bin CR009_bin_labels CR009_bin CR015_bin_labels CR015_bin
47044 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 1 2 1 1 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0]
44295 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 1 7 1 2 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0]
74783 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 1 4 1 1 1 (3.0, 6.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0]
70975 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 3 1 3 4 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0]
46645 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 2 15 5 6 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67666 67667 0 41 1 5 46967 6 11 56.0 56.0 ... 2 4 4 4 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 2 (1.999, 4.0]
51146 51147 0 39 1 2 25796 6 2 91.0 91.0 ... 3 14 4 5 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 2 (1.999, 4.0]
42494 42495 1 31 1 2 0 5 3 -1.0 -1.0 ... 1 3 1 2 0 (0.999, 3.0] 0 (24221.8, 50000.0] 1 (5.0, 6.0]
52517 52518 0 34 1 1 3600 3 2 -1.0 -1.0 ... 1 3 1 2 0 (0.999, 3.0] 1 (-0.001, 2500.0] 0 (4.0, 5.0]
7754 7755 0 43 3 2 52000 6 10 -1.0 -1.0 ... 1 10 3 5 0 (0.999, 3.0] 4 (2500.0, 11484.4] 2 (1.999, 4.0]

16000 rows × 23 columns

In [51]:
#Append the WOE table to the test data
test_df_WOE_CR015 = pd.merge(test_df,k[['CR015_bin','CR015_bin_WOE']],
     left_on='CR015_bin',
     right_on='CR015_bin',how='left')
test_df_WOE_CR015.head(10)
Out[51]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD009 TD010 TD014 AP003_bin_labels AP003_bin CR009_bin_labels CR009_bin CR015_bin_labels CR015_bin CR015_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 2 1 1 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] -0.27
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 7 1 2 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] -0.27
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 4 1 1 1 (3.0, 6.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] -0.27
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 1 3 4 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] -0.27
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 15 5 6 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] -0.27
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 8 3 4 1 (3.0, 6.0] 1 (-0.001, 2500.0] 2 (1.999, 4.0] 0.19
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 25 8 11 0 (0.999, 3.0] 1 (-0.001, 2500.0] 0 (4.0, 5.0] 0.08
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 3 1 2 0 (0.999, 3.0] 0 (24221.8, 50000.0] 1 (5.0, 6.0] -0.27
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 8 2 6 0 (0.999, 3.0] 2 (11484.4, 24221.8] 1 (5.0, 6.0] -0.27
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 6 2 3 0 (0.999, 3.0] 4 (2500.0, 11484.4] 1 (5.0, 6.0] -0.27

10 rows × 24 columns

In [52]:
nan_check = test_df_WOE_CR015['CR015_bin_WOE'].isna()
nan_values = test_df_WOE_CR015['CR015_bin_WOE'][nan_check]
nan_values
Out[52]:
Series([], Name: CR015_bin_WOE, dtype: float64)

CR019 (SCORE_SINGLE_DEBIT_CARD_LIMIT):¶

  • A numeric variable representing the credit score for a single debit card limit reported by the credit bureau.
  • This variable represents the score assigned to the maximum single debit card limit reported by the credit bureau. Higher values of SCORE_SINGLE_DEBIT_CARD_LIMIT indicate a higher assigned score to the maximum single debit card limit, suggesting a higher creditworthiness or a more favorable financial standing.
In [53]:
k = WOE('CR019')
k
Out[53]:
CR019 Count Good Bad Good % Bad % CR019_WOE
11 12 3499 564 2935 4.57 5.68 -0.22
10 11 10678 1753 8925 14.21 17.28 -0.20
8 9 2318 388 1930 3.14 3.74 -0.17
5 6 4136 744 3392 6.03 6.57 -0.09
9 10 1808 332 1476 2.69 2.86 -0.06
7 8 2615 484 2131 3.92 4.12 -0.05
6 7 5150 982 4168 7.96 8.07 -0.01
4 5 7699 1513 6186 12.26 11.97 0.02
0 1 872 182 690 1.48 1.34 0.10
2 3 10654 2263 8391 18.34 16.24 0.12
3 4 7761 1662 6099 13.47 11.81 0.13
1 2 6810 1471 5339 11.92 10.33 0.14
In [54]:
#Append the WOE value of each category back to the original train data
train_df_WOE_CR019 = pd.merge(train_df,k[['CR019','CR019_WOE']],
     left_on='CR019',
     right_on='CR019',how='left')
train_df_WOE_CR019.head(10)
Out[54]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD001 TD005 TD006 TD009 TD010 TD014 AP003_bin CR009_bin CR015_bin CR019_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 5 8 3 14 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] 0.02
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 2 2 0 2 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] -0.22
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 2 3 1 6 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] -0.22
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 5 9 3 9 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] 0.02
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 2 2 0 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] -0.01
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 5 11 3 11 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] 0.13
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 3 4 1 6 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] 0.12
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 4 4 1 6 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] 0.02
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 4 9 1 10 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] 0.02
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 2 3 3 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] -0.01

10 rows × 21 columns

In [55]:
#Append the WOE table to the test data
test_df_WOE_CR019 = pd.merge(test_df,k[['CR019','CR019_WOE']],
     left_on='CR019',
     right_on='CR019',how='left')
test_df_WOE_CR019.head(10)
Out[55]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD009 TD010 TD014 AP003_bin_labels AP003_bin CR009_bin_labels CR009_bin CR015_bin_labels CR015_bin CR019_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 2 1 1 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 0.02
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 7 1 2 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 0.02
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 4 1 1 1 (3.0, 6.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] -0.20
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 1 3 4 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 0.12
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 15 5 6 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] -0.20
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 8 3 4 1 (3.0, 6.0] 1 (-0.001, 2500.0] 2 (1.999, 4.0] -0.20
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 25 8 11 0 (0.999, 3.0] 1 (-0.001, 2500.0] 0 (4.0, 5.0] 0.12
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 3 1 2 0 (0.999, 3.0] 0 (24221.8, 50000.0] 1 (5.0, 6.0] 0.12
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 8 2 6 0 (0.999, 3.0] 2 (11484.4, 24221.8] 1 (5.0, 6.0] -0.05
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 6 2 3 0 (0.999, 3.0] 4 (2500.0, 11484.4] 1 (5.0, 6.0] -0.06

10 rows × 24 columns

In [56]:
nan_check = test_df_WOE_CR019['CR019'].isna()
nan_values = test_df_WOE_CR019['CR019_WOE'][nan_check]
nan_values
Out[56]:
Series([], Name: CR019_WOE, dtype: float64)
In [57]:
nan_check= train_df_WOE_CR019['CR019_WOE'].isna()
nan_values = train_df_WOE_CR019['CR019_WOE'][nan_check]
nan_values
Out[57]:
Series([], Name: CR019_WOE, dtype: float64)

TD001(TD_CNT_QUERY_LAST_7Day_P2P): ¶

  • A numeric variable representing the count of queries for P2P (peer-to-peer) loans in the last 7 days.
  • In the context of predicting loan default, this feature can provide insights into a borrower's recent P2P lending activity and potentially impact their likelihood of defaulting on a loan.
In [58]:
k = WOE('TD001')
k
Out[58]:
TD001 Count Good Bad Good % Bad % TD001_WOE
0 0 15698 2455 13243 19.90 25.63 -0.25
1 1 10707 1723 8984 13.96 17.39 -0.22
16 16 6 1 5 0.01 0.01 0.00
2 2 17835 3487 14348 28.26 27.77 0.02
3 3 9755 2069 7686 16.77 14.88 0.12
4 4 4891 1163 3728 9.43 7.22 0.27
5 5 2313 614 1699 4.98 3.29 0.41
10 10 112 29 83 0.24 0.16 0.41
6 6 1267 350 917 2.84 1.77 0.47
7 7 712 199 513 1.61 0.99 0.49
12 12 36 11 25 0.09 0.05 0.59
9 9 189 61 128 0.49 0.25 0.67
8 8 364 126 238 1.02 0.46 0.80
11 11 65 25 40 0.20 0.08 0.92
15 15 8 4 4 0.03 0.01 1.10
13 13 22 10 12 0.08 0.02 1.39
14 14 12 6 6 0.05 0.01 1.61
19 19 4 3 1 0.02 0.00 inf
18 18 2 1 1 0.01 0.00 inf
20 20 1 1 0 0.01 0.00 inf
17 17 1 0 1 0.00 0.00 NaN
In [59]:
#Bin the train data
train_df['TD001_bin'] = pd.qcut(train_df['TD001'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD001_bin'] = train_df['TD001_bin'].fillna("NoData").astype(str)
train_df['TD001_bin'].value_counts(dropna=False)
Out[59]:
(-0.001, 1.0]    26405
(1.0, 2.0]       17835
(3.0, 20.0]      10005
(2.0, 3.0]        9755
Name: TD001_bin, dtype: int64
In [60]:
k = WOE('TD001_bin')
k
Out[60]:
TD001_bin Count Good Bad Good % Bad % TD001_bin_WOE
0 (-0.001, 1.0] 26405 4178 22227 33.86 43.02 -0.24
1 (1.0, 2.0] 17835 3487 14348 28.26 27.77 0.02
2 (2.0, 3.0] 9755 2069 7686 16.77 14.88 0.12
3 (3.0, 20.0] 10005 2604 7401 21.11 14.33 0.39
In [61]:
#Append the WOE value of each category back to the original train data
train_df_WOE_TD001 = pd.merge(train_df,k[['TD001_bin','TD001_bin_WOE']],
     left_on='TD001_bin',
     right_on='TD001_bin',how='left')
train_df_WOE_TD001.head(10)
Out[61]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD005 TD006 TD009 TD010 TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD001_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 8 3 14 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] 0.39
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 2 0 2 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] 0.02
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 3 1 6 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] 0.02
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 9 3 9 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] 0.39
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 2 0 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] 0.02
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 11 3 11 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] 0.39
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 4 1 6 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] 0.12
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 4 1 6 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] 0.39
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 9 1 10 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] 0.39
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 3 3 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] 0.02

10 rows × 22 columns

In [62]:
nan_check= train_df_WOE_TD001['TD001_bin_WOE'].isna()
nan_values = train_df_WOE_TD001['TD001_bin_WOE'][nan_check]
nan_values
Out[62]:
Series([], Name: TD001_bin_WOE, dtype: float64)
In [63]:
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(1.0, 2.0]", "(3.0, 20.0]","(2.0, 3.0]"]
# Bin the test data with the specified labels
test_df['TD001_bin_labels'] = pd.qcut(test_df['TD001'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD001_bin'] = pd.qcut(test_df['TD001'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD001_bin'] = test_df['TD001_bin'].fillna("NoData")
# Print the value counts
test_df['TD001_bin'].value_counts(dropna=False)
Out[63]:
(-0.001, 1.0]    6635
(1.0, 2.0]       4364
(2.0, 3.0]       2570
(3.0, 20.0]      2431
Name: TD001_bin, dtype: int64
In [64]:
test_df_WOE_TD001 = pd.merge(test_df, k[['TD001_bin', 'TD001_bin_WOE']],
                             left_on='TD001_bin',
                             right_on='TD001_bin', how='left')
test_df_WOE_TD001.head(10)
Out[64]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD014 AP003_bin_labels AP003_bin CR009_bin_labels CR009_bin CR015_bin_labels CR015_bin TD001_bin_labels TD001_bin TD001_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 1 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0.02
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 2 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0.02
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 1 1 (3.0, 6.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] -0.24
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 4 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] -0.24
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 6 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 3 (2.0, 3.0] 0.12
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 4 1 (3.0, 6.0] 1 (-0.001, 2500.0] 2 (1.999, 4.0] 3 (2.0, 3.0] 0.12
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 11 0 (0.999, 3.0] 1 (-0.001, 2500.0] 0 (4.0, 5.0] 3 (2.0, 3.0] 0.12
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 2 0 (0.999, 3.0] 0 (24221.8, 50000.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0.02
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 6 0 (0.999, 3.0] 2 (11484.4, 24221.8] 1 (5.0, 6.0] 1 (1.0, 2.0] 0.02
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 3 0 (0.999, 3.0] 4 (2500.0, 11484.4] 1 (5.0, 6.0] 2 (3.0, 20.0] 0.39

10 rows × 26 columns

In [65]:
nan_check = test_df_WOE_TD001['TD001_bin_WOE'].isna()
nan_values = test_df_WOE_TD001['TD001_bin_WOE'][nan_check]
nan_values
Out[65]:
Series([], Name: TD001_bin_WOE, dtype: float64)

TD005 (TD_CNT_QUERY_LAST_1MON_P2P, Credit Center): ¶

  • This variable represents the count of P2P credit queries made in the last 1 month by the Credit Center. Higher values of TD005 indicate a greater number of P2P credit queries within the last 1 month, suggesting increased credit activity or more frequent inquiries for evaluating creditworthiness.
In [66]:
k = WOE('TD005')
k
Out[66]:
TD005 Count Good Bad Good % Bad % TD005_WOE
1 1 6735 844 5891 6.84 11.40 -0.51
0 0 6157 821 5336 6.65 10.33 -0.44
2 2 13559 2188 11371 17.73 22.01 -0.22
3 3 10995 2076 8919 16.83 17.26 -0.03
23 23 4 1 3 0.01 0.01 0.00
24 24 4 1 3 0.01 0.01 0.00
4 4 8174 1633 6541 13.24 12.66 0.04
5 5 5779 1340 4439 10.86 8.59 0.23
6 6 4081 991 3090 8.03 5.98 0.29
7 7 2835 739 2096 5.99 4.06 0.39
8 8 1928 510 1418 4.13 2.74 0.41
18 18 31 9 22 0.07 0.04 0.56
11 11 566 170 396 1.38 0.77 0.58
9 9 1285 386 899 3.13 1.74 0.59
10 10 785 244 541 1.98 1.05 0.63
13 13 254 81 173 0.66 0.33 0.69
22 22 9 3 6 0.02 0.01 0.69
21 21 10 3 7 0.02 0.01 0.69
20 20 16 5 11 0.04 0.02 0.69
12 12 348 115 233 0.93 0.45 0.73
16 16 77 29 48 0.24 0.09 0.98
19 19 24 10 14 0.08 0.03 0.98
15 15 110 43 67 0.35 0.13 0.99
17 17 59 24 35 0.19 0.07 1.00
14 14 175 72 103 0.58 0.20 1.06
In [67]:
#Append the WOE value of each category back to the original train data
train_df_WOE_TD005 = pd.merge(train_df,k[['TD005','TD005_WOE']],
     left_on='TD005',
     right_on='TD005',how='left')
train_df_WOE_TD005.head(10)
Out[67]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD005 TD006 TD009 TD010 TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD005_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 8 3 14 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] 0.41
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 2 0 2 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] -0.22
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 3 1 6 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] -0.03
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 9 3 9 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] 0.59
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 2 0 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] -0.22
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 11 3 11 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] 0.58
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 4 1 6 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] 0.04
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 4 1 6 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] 0.04
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 9 1 10 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] 0.59
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 3 3 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] -0.03

10 rows × 22 columns

In [68]:
#Append the WOE table to the test data
test_df_WOE_TD005 = pd.merge(test_df,k[['TD005','TD005_WOE']],
     left_on='TD005',
     right_on='TD005',how='left')
test_df_WOE_TD005.head(10)
Out[68]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD014 AP003_bin_labels AP003_bin CR009_bin_labels CR009_bin CR015_bin_labels CR015_bin TD001_bin_labels TD001_bin TD005_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 1 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 1 (1.0, 2.0] -0.22
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 2 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0.04
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 1 1 (3.0, 6.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] -0.03
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 4 0 (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] -0.51
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 6 0 (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 3 (2.0, 3.0] 0.39
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 4 1 (3.0, 6.0] 1 (-0.001, 2500.0] 2 (1.999, 4.0] 3 (2.0, 3.0] 0.39
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 11 0 (0.999, 3.0] 1 (-0.001, 2500.0] 0 (4.0, 5.0] 3 (2.0, 3.0] 1.06
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 2 0 (0.999, 3.0] 0 (24221.8, 50000.0] 1 (5.0, 6.0] 1 (1.0, 2.0] -0.03
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 6 0 (0.999, 3.0] 2 (11484.4, 24221.8] 1 (5.0, 6.0] 1 (1.0, 2.0] 0.23
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 3 0 (0.999, 3.0] 4 (2500.0, 11484.4] 1 (5.0, 6.0] 2 (3.0, 20.0] -0.03

10 rows × 26 columns

In [69]:
nan_check = test_df_WOE_TD005['TD005_WOE'].isna()
nan_values = test_df_WOE_TD005['TD005_WOE'][nan_check]
nan_values
Out[69]:
Series([], Name: TD005_WOE, dtype: float64)
In [70]:
nan_check= train_df_WOE_TD005['TD005_WOE'].isna()
nan_values = train_df_WOE_TD005['TD005_WOE'][nan_check]
nan_values
Out[70]:
Series([], Name: TD005_WOE, dtype: float64)

TD006 (TD_CNT_QUERY_LAST_1MON_SMALL_LOAN, Credit Center): ¶

  • A numeric variable representing the count of queries for small loans in the last 1 month.
  • This variable represents the count of small loan credit queries made in the last 1 month by the Credit Center. Higher values of TD006 indicate a greater number of small loan credit queries within the last 1 month, suggesting increased credit activity or more frequent inquiries for evaluating creditworthiness.
In [71]:
k = WOE('TD006')
k
Out[71]:
TD006 Count Good Bad Good % Bad % TD006_WOE
0 0 18701 3135 15566 25.41 30.13 -0.17
1 1 23081 4027 19054 32.64 36.88 -0.12
14 14 6 1 5 0.01 0.01 0.00
13 13 12 3 9 0.02 0.02 0.00
2 2 12417 2610 9807 21.15 18.98 0.11
3 3 5461 1299 4162 10.53 8.06 0.27
4 4 2292 624 1668 5.06 3.23 0.45
5 5 1014 295 719 2.39 1.39 0.54
7 7 242 72 170 0.58 0.33 0.56
6 6 464 151 313 1.22 0.61 0.69
10 10 47 16 31 0.13 0.06 0.77
8 8 127 44 83 0.36 0.16 0.81
11 11 26 11 15 0.09 0.03 1.10
9 9 86 40 46 0.32 0.09 1.27
12 12 13 6 7 0.05 0.01 1.61
18 18 3 1 2 0.01 0.00 inf
17 17 4 2 2 0.02 0.00 inf
20 21 1 1 0 0.01 0.00 inf
15 15 1 0 1 0.00 0.00 NaN
16 16 1 0 1 0.00 0.00 NaN
19 20 1 0 1 0.00 0.00 NaN
In [72]:
#Bin the train data
train_df['TD006_bin'] = pd.qcut(train_df['TD006'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD006_bin'] = train_df['TD006_bin'].fillna("NoData").astype(str)
train_df['TD006_bin'].value_counts(dropna=False)
Out[72]:
(-0.001, 1.0]    41782
(1.0, 2.0]       12417
(2.0, 21.0]       9801
Name: TD006_bin, dtype: int64
In [73]:
k = WOE('TD006_bin')
k
Out[73]:
TD006_bin Count Good Bad Good % Bad % TD006_bin_WOE
0 (-0.001, 1.0] 41782 7162 34620 58.05 67.01 -0.14
1 (1.0, 2.0] 12417 2610 9807 21.15 18.98 0.11
2 (2.0, 21.0] 9801 2566 7235 20.80 14.00 0.40
In [74]:
#Append the WOE value of each category back to the original train data
train_df_WOE_TD006 = pd.merge(train_df,k[['TD006_bin','TD006_bin_WOE']],
     left_on='TD006_bin',
     right_on='TD006_bin',how='left')
train_df_WOE_TD006.head(10)
Out[74]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD006 TD009 TD010 TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD006_bin TD006_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 3 14 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] 0.40
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 0 2 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] -0.14
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 1 6 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] -0.14
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 3 9 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] 0.40
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 0 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] -0.14
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 3 11 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] 0.40
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 1 6 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] -0.14
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 1 6 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] -0.14
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 1 10 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] -0.14
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 3 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] 0.40

10 rows × 23 columns

In [75]:
nan_check= train_df_WOE_TD006['TD006_bin_WOE'].isna()
nan_values = train_df_WOE_TD006['TD006_bin_WOE'][nan_check]
nan_values
Out[75]:
Series([], Name: TD006_bin_WOE, dtype: float64)
In [76]:
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(1.0, 2.0]","(2.0, 21.0]"]
# Bin the test data with the specified labels
test_df['TD006_bin_labels'] = pd.qcut(test_df['TD006'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD006_bin'] = pd.qcut(test_df['TD006'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD006_bin'] = test_df['TD006_bin'].fillna("NoData")
# Print the value counts
test_df['TD006_bin'].value_counts(dropna=False)
Out[76]:
(-0.001, 1.0]    10475
(1.0, 2.0]        3110
(2.0, 21.0]       2415
Name: TD006_bin, dtype: int64
In [77]:
test_df_WOE_TD006 = pd.merge(test_df, k[['TD006_bin', 'TD006_bin_WOE']],
                             left_on='TD006_bin',
                             right_on='TD006_bin', how='left')
test_df_WOE_TD006.head(10)
Out[77]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... AP003_bin CR009_bin_labels CR009_bin CR015_bin_labels CR015_bin TD001_bin_labels TD001_bin TD006_bin_labels TD006_bin TD006_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] -0.14
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] -0.14
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... (3.0, 6.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] -0.14
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... (0.999, 3.0] 1 (-0.001, 2500.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] 2 (2.0, 21.0] 0.40
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... (0.999, 3.0] 3 (50000.0, 1420300.0] 1 (5.0, 6.0] 3 (2.0, 3.0] 1 (1.0, 2.0] 0.11
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... (3.0, 6.0] 1 (-0.001, 2500.0] 2 (1.999, 4.0] 3 (2.0, 3.0] 0 (-0.001, 1.0] -0.14
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... (0.999, 3.0] 1 (-0.001, 2500.0] 0 (4.0, 5.0] 3 (2.0, 3.0] 2 (2.0, 21.0] 0.40
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... (0.999, 3.0] 0 (24221.8, 50000.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] -0.14
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... (0.999, 3.0] 2 (11484.4, 24221.8] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] -0.14
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... (0.999, 3.0] 4 (2500.0, 11484.4] 1 (5.0, 6.0] 2 (3.0, 20.0] 0 (-0.001, 1.0] -0.14

10 rows × 28 columns

In [78]:
nan_check = test_df_WOE_TD006['TD006_bin_WOE'].isna()
nan_values = test_df_WOE_TD006['TD006_bin_WOE'][nan_check]
nan_values
Out[78]:
Series([], Name: TD006_bin_WOE, dtype: float64)

TD009 (TD_CNT_QUERY_LAST_3MON_P2P): ¶

  • A numeric variable representing the count of queries for P2P loans in the last 3 months.
In [79]:
k = WOE('TD009')
k
Out[79]:
TD009 Count Good Bad Good % Bad % TD009_WOE
0 0 2951 340 2611 2.76 5.05 -0.60
1 1 3838 443 3395 3.59 6.57 -0.60
2 2 9131 1255 7876 10.17 15.25 -0.41
3 3 8864 1388 7476 11.25 14.47 -0.25
4 4 7712 1367 6345 11.08 12.28 -0.10
27 27 13 2 11 0.02 0.02 0.00
31 31 4 1 3 0.01 0.01 0.00
5 5 6316 1254 5062 10.16 9.80 0.04
6 6 5198 1083 4115 8.78 7.97 0.10
7 7 4339 959 3380 7.77 6.54 0.17
8 8 3458 818 2640 6.63 5.11 0.26
23 23 53 14 39 0.11 0.08 0.32
9 9 2941 746 2195 6.05 4.25 0.35
16 16 367 95 272 0.77 0.53 0.37
13 13 1010 265 745 2.15 1.44 0.40
10 10 2280 611 1669 4.95 3.23 0.43
11 11 1711 461 1250 3.74 2.42 0.44
18 18 206 57 149 0.46 0.29 0.46
12 12 1420 419 1001 3.40 1.94 0.56
14 14 803 256 547 2.07 1.06 0.67
17 17 296 95 201 0.77 0.39 0.68
29 29 5 2 3 0.02 0.01 0.69
30 30 7 3 4 0.02 0.01 0.69
19 19 161 54 107 0.44 0.21 0.74
15 15 529 183 346 1.48 0.67 0.79
25 25 42 16 26 0.13 0.05 0.96
20 20 119 48 71 0.39 0.14 1.02
26 26 24 11 13 0.09 0.03 1.10
21 21 91 38 53 0.31 0.10 1.13
22 22 58 26 32 0.21 0.06 1.25
24 24 28 15 13 0.12 0.03 1.39
28 28 14 8 6 0.06 0.01 1.79
34 34 3 2 1 0.02 0.00 inf
32 32 3 2 1 0.02 0.00 inf
37 46 1 1 0 0.01 0.00 inf
33 33 2 0 2 0.00 0.00 NaN
35 36 1 0 1 0.00 0.00 NaN
36 38 1 0 1 0.00 0.00 NaN
In [80]:
#Bin the train data
train_df['TD009_bin'] = pd.qcut(train_df['TD009'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD009_bin'] = train_df['TD009_bin'].fillna("NoData").astype(str)
train_df['TD009_bin'].value_counts(dropna=False)
Out[80]:
(2.0, 4.0]       16576
(-0.001, 2.0]    15920
(5.0, 8.0]       12995
(8.0, 46.0]      12193
(4.0, 5.0]        6316
Name: TD009_bin, dtype: int64
In [81]:
k = WOE('TD009_bin')
k
Out[81]:
TD009_bin Count Good Bad Good % Bad % TD009_bin_WOE
0 (-0.001, 2.0] 15920 2038 13882 16.52 26.87 -0.49
1 (2.0, 4.0] 16576 2755 13821 22.33 26.75 -0.18
2 (4.0, 5.0] 6316 1254 5062 10.16 9.80 0.04
3 (5.0, 8.0] 12995 2860 10135 23.18 19.62 0.17
4 (8.0, 46.0] 12193 3431 8762 27.81 16.96 0.49
In [82]:
#Append the WOE value of each category back to the original train data
train_df_WOE_TD009 = pd.merge(train_df,k[['TD009_bin','TD009_bin_WOE']],
     left_on='TD009_bin',
     right_on='TD009_bin',how='left')
train_df_WOE_TD009.head(10)
Out[82]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD009 TD010 TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD006_bin TD009_bin TD009_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 14 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] 0.49
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 2 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] -0.49
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 6 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] 0.17
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 9 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] 0.49
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 2 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] -0.49
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 11 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] 0.49
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 6 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] (5.0, 8.0] 0.17
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 6 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] (5.0, 8.0] 0.17
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 10 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (8.0, 46.0] 0.49
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 5 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (4.0, 5.0] 0.04

10 rows × 24 columns

In [83]:
nan_check= train_df_WOE_TD009['TD009_bin_WOE'].isna()
nan_values = train_df_WOE_TD009['TD009_bin_WOE'][nan_check]
nan_values
Out[83]:
Series([], Name: TD009_bin_WOE, dtype: float64)
In [84]:
# Define the desired bin labels
bin_labels = ["(2.0, 4.0]", "(-0.001, 2.0]", "(5.0, 8.0]", "(8.0, 46.0]", "(4.0, 5.0]"]
# Bin the test data with the specified labels
test_df['TD009_bin_labels'] = pd.qcut(test_df['TD009'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD009_bin'] = pd.qcut(test_df['TD009'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD009_bin'] = test_df['TD009_bin'].fillna("NoData")
# Print the value counts
test_df['TD009_bin'].value_counts(dropna=False)
Out[84]:
(-0.001, 2.0]    4146
(2.0, 4.0]       4078
(8.0, 46.0]      3150
(4.0, 5.0]       3067
(5.0, 8.0]       1559
Name: TD009_bin, dtype: int64
In [85]:
#Append the WOE table to the test data
test_df_WOE_TD009 = pd.merge(test_df,k[['TD009_bin','TD009_bin_WOE']],
     left_on='TD009_bin',
     right_on='TD009_bin',how='left')
test_df_WOE_TD009.head(10)
Out[85]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... CR009_bin CR015_bin_labels CR015_bin TD001_bin_labels TD001_bin TD006_bin_labels TD006_bin TD009_bin_labels TD009_bin TD009_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... (-0.001, 2500.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 0 (2.0, 4.0] -0.18
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... (50000.0, 1420300.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0.49
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... (50000.0, 1420300.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] -0.49
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... (-0.001, 2500.0] 1 (5.0, 6.0] 0 (-0.001, 1.0] 2 (2.0, 21.0] 0 (2.0, 4.0] -0.18
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... (50000.0, 1420300.0] 1 (5.0, 6.0] 3 (2.0, 3.0] 1 (1.0, 2.0] 4 (4.0, 5.0] 0.04
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... (-0.001, 2500.0] 2 (1.999, 4.0] 3 (2.0, 3.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0.49
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... (-0.001, 2500.0] 0 (4.0, 5.0] 3 (2.0, 3.0] 2 (2.0, 21.0] 4 (4.0, 5.0] 0.04
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... (24221.8, 50000.0] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] -0.49
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... (11484.4, 24221.8] 1 (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0.49
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... (2500.0, 11484.4] 1 (5.0, 6.0] 2 (3.0, 20.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0.49

10 rows × 30 columns

In [86]:
nan_check = test_df_WOE_TD009['TD009_bin_WOE'].isna()
nan_values = test_df_WOE_TD009['TD009_bin_WOE'][nan_check]
nan_values
Out[86]:
Series([], Name: TD009_bin_WOE, dtype: float64)

TD010 (TD_CNT_QUERY_LAST_3MON_SMALL_LOAN): ¶

  • A numeric variable representing the count of queries for small loans in the last 3 months.
  • It provides insights into the borrower's recent engagement with small loan products and can contribute to predicting loan default.
In [87]:
k = WOE('TD010')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[87]:
TD010 Count Good Bad Good % Bad % TD010_WOE
19 19 4 0 4 0.00 0.01 -inf
0 0 12378 1879 10499 15.23 20.32 -0.29
1 1 18591 3008 15583 24.38 30.16 -0.21
2 2 13916 2689 11227 21.79 21.73 0.00
3 3 8472 1859 6613 15.07 12.80 0.16
4 4 4696 1161 3535 9.41 6.84 0.32
5 5 2601 718 1883 5.82 3.64 0.47
6 6 1406 394 1012 3.19 1.96 0.49
7 7 772 230 542 1.86 1.05 0.57
8 8 430 128 302 1.04 0.58 0.58
22 22 6 3 3 0.02 0.01 0.69
12 12 63 20 43 0.16 0.08 0.69
9 9 232 78 154 0.63 0.30 0.74
13 13 55 20 35 0.16 0.07 0.83
11 11 112 39 73 0.32 0.14 0.83
10 10 147 54 93 0.44 0.18 0.89
15 15 19 8 11 0.06 0.02 1.10
18 18 9 4 5 0.03 0.01 1.10
14 14 49 22 27 0.18 0.05 1.28
16 16 13 7 6 0.06 0.01 1.79
17 17 12 8 4 0.06 0.01 1.79
20 20 3 1 2 0.01 0.00 inf
21 21 3 2 1 0.02 0.00 inf
24 24 4 3 1 0.02 0.00 inf
25 25 2 1 1 0.01 0.00 inf
28 30 1 1 0 0.01 0.00 inf
29 35 1 1 0 0.01 0.00 inf
23 23 1 0 1 0.00 0.00 NaN
26 26 1 0 1 0.00 0.00 NaN
27 28 1 0 1 0.00 0.00 NaN
In [88]:
#Bin the train data
train_df['TD010_bin'] = pd.qcut(train_df['TD010'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD010_bin'] = train_df['TD010_bin'].fillna("NoData").astype(str)
train_df['TD010_bin'].value_counts(dropna=False)
Out[88]:
(-0.001, 1.0]    30969
(1.0, 2.0]       13916
(3.0, 35.0]      10643
(2.0, 3.0]        8472
Name: TD010_bin, dtype: int64
In [89]:
k = WOE('TD010_bin')
k
Out[89]:
TD010_bin Count Good Bad Good % Bad % TD010_bin_WOE
0 (-0.001, 1.0] 30969 4887 26082 39.61 50.49 -0.24
1 (1.0, 2.0] 13916 2689 11227 21.79 21.73 0.00
2 (2.0, 3.0] 8472 1859 6613 15.07 12.80 0.16
3 (3.0, 35.0] 10643 2903 7740 23.53 14.98 0.45
In [90]:
#Append the WOE value of each category back to the original train data
train_df_WOE_TD010 = pd.merge(train_df,k[['TD010_bin','TD010_bin_WOE']],
     left_on='TD010_bin',
     right_on='TD010_bin',how='left')
train_df_WOE_TD010.head(10)
Out[90]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD010 TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD006_bin TD009_bin TD010_bin TD010_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 5 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] 0.45
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 1 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] -0.24
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 2 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] (1.0, 2.0] 0.00
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 3 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] 0.16
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 0 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] -0.24
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 4 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] 0.45
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 3 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] (5.0, 8.0] (2.0, 3.0] 0.16
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 1 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] (5.0, 8.0] (-0.001, 1.0] -0.24
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 1 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (8.0, 46.0] (-0.001, 1.0] -0.24
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 3 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (4.0, 5.0] (2.0, 3.0] 0.16

10 rows × 25 columns

In [91]:
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(1.0, 2.0]", "(3.0, 35.0]", "(2.0, 3.0]"]
# Bin the test data with the specified labels
test_df['TD010_bin_labels'] = pd.qcut(test_df['TD010'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD010_bin'] = pd.qcut(test_df['TD010'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD010_bin'] = test_df['TD010_bin'].fillna("NoData")
# Print the value counts
test_df['TD010_bin'].value_counts(dropna=False)
Out[91]:
(-0.001, 1.0]    7789
(1.0, 2.0]       3472
(2.0, 3.0]       2665
(3.0, 35.0]      2074
Name: TD010_bin, dtype: int64
In [92]:
#Append the WOE table to the test data
test_df_WOE_TD010 = pd.merge(test_df,k[['TD010_bin','TD010_bin_WOE']],
     left_on='TD010_bin',
     right_on='TD010_bin',how='left')
test_df_WOE_TD010.head(10)
Out[92]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... CR015_bin TD001_bin_labels TD001_bin TD006_bin_labels TD006_bin TD009_bin_labels TD009_bin TD010_bin_labels TD010_bin TD010_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 0 (2.0, 4.0] 0 (-0.001, 1.0] -0.24
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0 (-0.001, 1.0] -0.24
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... (5.0, 6.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] -0.24
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... (5.0, 6.0] 0 (-0.001, 1.0] 2 (2.0, 21.0] 0 (2.0, 4.0] 2 (3.0, 35.0] 0.45
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... (5.0, 6.0] 3 (2.0, 3.0] 1 (1.0, 2.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 0.16
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... (1.999, 4.0] 3 (2.0, 3.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 2 (3.0, 35.0] 0.45
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... (4.0, 5.0] 3 (2.0, 3.0] 2 (2.0, 21.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 0.16
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] -0.24
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... (5.0, 6.0] 1 (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 0.00
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... (5.0, 6.0] 2 (3.0, 20.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 0.00

10 rows × 32 columns

In [93]:
nan_check = train_df_WOE_TD010['TD010_bin_WOE'].isna()
nan_values = train_df_WOE_TD010['TD010_bin_WOE'][nan_check]
nan_values
Out[93]:
Series([], Name: TD010_bin_WOE, dtype: float64)
In [94]:
nan_check = test_df_WOE_TD010['TD010_bin_WOE'].isna()
nan_values = test_df_WOE_TD010['TD010_bin_WOE'][nan_check]
nan_values
Out[94]:
Series([], Name: TD010_bin_WOE, dtype: float64)

TD014 (TD_CNT_QUERY_LAST_6MON_SMALL_LOAN): ¶

  • A numeric variable representing the count of queries for small loans in the last 6 months.
In [95]:
k = WOE('TD014')
k
Out[95]:
TD014 Count Good Bad Good % Bad % TD014_WOE
0 0 9486 1357 8129 11.00 15.73 -0.36
1 1 15573 2400 13173 19.45 25.50 -0.27
2 2 13366 2407 10959 19.51 21.21 -0.08
3 3 9156 1856 7300 15.04 14.13 0.06
4 4 5967 1414 4553 11.46 8.81 0.26
18 18 21 5 16 0.04 0.03 0.29
5 5 3755 941 2814 7.63 5.45 0.34
6 6 2332 625 1707 5.07 3.30 0.43
7 7 1465 422 1043 3.42 2.02 0.53
10 10 408 117 291 0.95 0.56 0.53
13 13 120 35 85 0.28 0.16 0.56
9 9 630 188 442 1.52 0.86 0.57
8 8 938 288 650 2.33 1.26 0.61
11 11 274 88 186 0.71 0.36 0.68
12 12 182 60 122 0.49 0.24 0.71
14 14 103 36 67 0.29 0.13 0.80
19 19 16 6 10 0.05 0.02 0.92
20 20 17 6 11 0.05 0.02 0.92
16 16 50 20 30 0.16 0.06 0.98
21 21 11 4 7 0.03 0.01 1.10
15 15 60 26 34 0.21 0.07 1.10
17 17 35 18 17 0.15 0.03 1.61
31 36 1 1 0 0.01 0.00 inf
22 22 9 7 2 0.06 0.00 inf
23 23 4 2 2 0.02 0.00 inf
24 24 5 3 2 0.02 0.00 inf
25 25 4 2 2 0.02 0.00 inf
26 26 2 1 1 0.01 0.00 inf
27 28 4 2 2 0.02 0.00 inf
32 43 1 1 0 0.01 0.00 inf
28 30 1 0 1 0.00 0.00 NaN
29 31 2 0 2 0.00 0.00 NaN
30 32 2 0 2 0.00 0.00 NaN
In [96]:
#Bin the train data
train_df['TD014_bin'] = pd.qcut(train_df['TD014'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD014_bin'] = train_df['TD014_bin'].fillna("NoData").astype(str)
train_df['TD014_bin'].value_counts(dropna=False)
Out[96]:
(-0.001, 1.0]    25059
(2.0, 4.0]       15123
(1.0, 2.0]       13366
(4.0, 43.0]      10452
Name: TD014_bin, dtype: int64
In [97]:
k = WOE('TD014_bin')
k
Out[97]:
TD014_bin Count Good Bad Good % Bad % TD014_bin_WOE
0 (-0.001, 1.0] 25059 3757 21302 30.45 41.23 -0.30
1 (1.0, 2.0] 13366 2407 10959 19.51 21.21 -0.08
2 (2.0, 4.0] 15123 3270 11853 26.50 22.94 0.14
3 (4.0, 43.0] 10452 2904 7548 23.54 14.61 0.48
In [98]:
#Append the WOE value of each category back to the original train data
train_df_WOE_TD014 = pd.merge(train_df,k[['TD014_bin','TD014_bin_WOE']],
     left_on='TD014_bin',
     right_on='TD014_bin',how='left')
train_df_WOE_TD014.head(10)
Out[98]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD006_bin TD009_bin TD010_bin TD014_bin TD014_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (4.0, 43.0] 0.48
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] -0.30
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] (1.0, 2.0] (1.0, 2.0] -0.08
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] (2.0, 4.0] 0.14
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] -0.30
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... 4 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (2.0, 4.0] 0.14
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... 3 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] (5.0, 8.0] (2.0, 3.0] (2.0, 4.0] 0.14
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... 1 (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] (5.0, 8.0] (-0.001, 1.0] (-0.001, 1.0] -0.30
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (8.0, 46.0] (-0.001, 1.0] (1.0, 2.0] -0.08
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... 3 (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (4.0, 5.0] (2.0, 3.0] (2.0, 4.0] 0.14

10 rows × 26 columns

In [99]:
nan_check = train_df_WOE_TD014['TD014_bin_WOE'].isna()
nan_values = train_df_WOE_TD014['TD014_bin_WOE'][nan_check]
nan_values
Out[99]:
Series([], Name: TD014_bin_WOE, dtype: float64)
In [100]:
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(2.0, 4.0]", "(1.0, 2.0]", "(4.0, 43.0]"]
# Bin the test data with the specified labels
test_df['TD014_bin_labels'] = pd.qcut(test_df['TD014'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD014_bin'] = pd.qcut(test_df['TD014'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD014_bin'] = test_df['TD014_bin'].fillna("NoData")
# Print the value counts
test_df['TD014_bin'].value_counts(dropna=False)
Out[100]:
(-0.001, 1.0]    6323
(1.0, 2.0]       3809
(2.0, 4.0]       3279
(4.0, 43.0]      2589
Name: TD014_bin, dtype: int64
In [101]:
#Append the WOE table to the test data
test_df_WOE_TD014 = pd.merge(test_df,k[['TD014_bin','TD014_bin_WOE']],
     left_on='TD014_bin',
     right_on='TD014_bin',how='left')
test_df_WOE_TD014.head(10)
Out[101]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD001_bin TD006_bin_labels TD006_bin TD009_bin_labels TD009_bin TD010_bin_labels TD010_bin TD014_bin_labels TD014_bin TD014_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... (1.0, 2.0] 0 (-0.001, 1.0] 0 (2.0, 4.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] -0.30
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] 0.14
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... (-0.001, 1.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] -0.30
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... (-0.001, 1.0] 2 (2.0, 21.0] 0 (2.0, 4.0] 2 (3.0, 35.0] 2 (1.0, 2.0] -0.08
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... (2.0, 3.0] 1 (1.0, 2.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] 0.48
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... (2.0, 3.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 2 (3.0, 35.0] 2 (1.0, 2.0] -0.08
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... (2.0, 3.0] 2 (2.0, 21.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] 0.48
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... (1.0, 2.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] 0.14
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 3 (4.0, 43.0] 0.48
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... (3.0, 20.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 2 (1.0, 2.0] -0.08

10 rows × 34 columns

In [102]:
nan_check = test_df_WOE_TD014['TD014_bin_WOE'].isna()
nan_values = test_df_WOE_TD014['TD014_bin_WOE'][nan_check]
nan_values
Out[102]:
Series([], Name: TD014_bin_WOE, dtype: float64)

PA022 (DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_OR_HIGH_RISK_CALL, Call Detail): ¶

  • A numeric variable representing the number of days between the application and the first collection or high-risk call.
  • It provides insights into the time lapse between applying for the loan and the initiation of collection activities or high-risk calls. A longer duration may indicate delayed or prolonged collection efforts, potentially impacting the loan default risk.
In [103]:
k = WOE('PA022')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[103]:
PA022 Count Good Bad Good % Bad % PA022_WOE
125 123.0 4 0 4 0.00 0.01 -inf
123 121.0 10 1 9 0.01 0.02 -0.69
53 51.0 177 23 154 0.19 0.30 -0.46
0 -99.0 1196 179 1017 1.45 1.97 -0.31
74 72.0 195 29 166 0.24 0.32 -0.29
... ... ... ... ... ... ... ...
155 426.0 1 0 1 0.00 0.00 NaN
157 437.0 1 0 1 0.00 0.00 NaN
158 440.0 1 0 1 0.00 0.00 NaN
159 441.0 1 0 1 0.00 0.00 NaN
161 448.0 1 0 1 0.00 0.00 NaN

163 rows × 7 columns

In [104]:
#Bin the train data
#Convert the 'PA022' column to numeric values, and any non-numeric values (including 'NoData') will be replaced with NaN using the errors='coerce'
train_df['PA022'] = pd.to_numeric(train_df['PA022'], errors='coerce')
train_df['PA022_bin'] = pd.qcut(train_df['PA022'],5,duplicates='drop').values.add_categories("NoData")
train_df['PA022_bin'] = train_df['PA022_bin'].fillna("NoData").astype(str)
train_df['PA022_bin'].value_counts(dropna=False)
Out[104]:
(-99.001, -1.0]    41766
(59.0, 448.0]      12644
(-1.0, 59.0]        9278
NoData               312
Name: PA022_bin, dtype: int64
In [105]:
train_df
Out[105]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD014 AP003_bin CR009_bin CR015_bin TD001_bin TD006_bin TD009_bin TD010_bin TD014_bin PA022_bin
3822 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... 5 (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (4.0, 43.0] (-99.001, -1.0]
35562 35563 1 47 1 2 0 6 12 87.0 87.0 ... 1 (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (59.0, 448.0]
4883 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... 2 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] (1.0, 2.0] (1.0, 2.0] (-99.001, -1.0]
71170 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... 3 (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0]
25665 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... 0 (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6265 6266 0 25 3 3 12000 5 3 -1.0 -1.0 ... 2 (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (4.0, 5.0] (-0.001, 1.0] (1.0, 2.0] (-99.001, -1.0]
54886 54887 0 31 3 4 60300 6 5 69.0 -1.0 ... 1 (0.999, 3.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (4.0, 5.0] (-0.001, 1.0] (-0.001, 1.0] (59.0, 448.0]
76820 76821 0 28 3 2 45167 5 3 -1.0 -1.0 ... 3 (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0]
860 861 1 28 1 5 59111 6 11 -1.0 -1.0 ... 2 (0.999, 3.0] (50000.0, 1420300.0] (5.0, 6.0] (-0.001, 1.0] (1.0, 2.0] (5.0, 8.0] (1.0, 2.0] (1.0, 2.0] (-99.001, -1.0]
15795 15796 0 27 1 4 2878 5 2 -1.0 -1.0 ... 1 (0.999, 3.0] (2500.0, 11484.4] (4.0, 5.0] (-0.001, 1.0] (-0.001, 1.0] (2.0, 4.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0]

64000 rows × 26 columns

In [106]:
k = WOE('PA022_bin')
k
Out[106]:
PA022_bin Count Good Bad Good % Bad % PA022_bin_WOE
1 (-99.001, -1.0] 41766 7093 34673 57.49 67.12 -0.15
0 (-1.0, 59.0] 9278 2121 7157 17.19 13.85 0.22
2 (59.0, 448.0] 12644 3045 9599 24.68 18.58 0.28
3 NoData 312 79 233 0.64 0.45 0.35
In [107]:
train_df_WOE_PA022 = pd.merge(train_df, k[['PA022_bin', 'PA022_bin_WOE']],
                             left_on='PA022_bin',
                             right_on='PA022_bin', how='left')
train_df_WOE_PA022.head(10)
Out[107]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... AP003_bin CR009_bin CR015_bin TD001_bin TD006_bin TD009_bin TD010_bin TD014_bin PA022_bin PA022_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... (3.0, 6.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (4.0, 43.0] (-99.001, -1.0] -0.15
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (59.0, 448.0] 0.28
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] (1.0, 2.0] (1.0, 2.0] (-99.001, -1.0] -0.15
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... (0.999, 3.0] (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0] -0.15
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... (3.0, 6.0] (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0] -0.15
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... (0.999, 3.0] (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (2.0, 4.0] (-99.001, -1.0] -0.15
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... (0.999, 3.0] (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] (5.0, 8.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0] -0.15
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... (0.999, 3.0] (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] (5.0, 8.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0] -0.15
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... (0.999, 3.0] (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (8.0, 46.0] (-0.001, 1.0] (1.0, 2.0] (-1.0, 59.0] 0.22
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... (0.999, 3.0] (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (4.0, 5.0] (2.0, 3.0] (2.0, 4.0] (59.0, 448.0] 0.28

10 rows × 27 columns

In [108]:
nan_check= train_df_WOE_PA022['PA022_bin_WOE'].isna()
nan_values = train_df_WOE_PA022['PA022_bin_WOE'][nan_check]
nan_values
Out[108]:
Series([], Name: PA022_bin_WOE, dtype: float64)
In [109]:
test_df['PA022'] = pd.to_numeric(test_df['PA022'], errors='coerce')
test_df['PA022_bin'] = pd.qcut(test_df['PA022'],5,duplicates='drop').values.add_categories("NoData")
test_df['PA022_bin'] = test_df['PA022_bin'].fillna("NoData").astype(str)
test_df['PA022_bin'].value_counts(dropna=False)
Out[109]:
(-99.001, -1.0]    10407
(57.0, 434.0]       3147
(-1.0, 57.0]        2377
NoData                69
Name: PA022_bin, dtype: int64
In [110]:
test_df.head()
Out[110]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD001_bin TD006_bin_labels TD006_bin TD009_bin_labels TD009_bin TD010_bin_labels TD010_bin TD014_bin_labels TD014_bin PA022_bin
47044 47045 0 30 3 3 10000 5 5 25.0 25.0 ... (1.0, 2.0] 0 (-0.001, 1.0] 0 (2.0, 4.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 57.0]
44295 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... (1.0, 2.0] 0 (-0.001, 1.0] 3 (8.0, 46.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (-99.001, -1.0]
74783 74784 0 29 4 5 33000 5 11 51.0 51.0 ... (-0.001, 1.0] 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 57.0]
70975 70976 1 28 1 5 3000 5 3 85.0 85.0 ... (-0.001, 1.0] 2 (2.0, 21.0] 0 (2.0, 4.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (57.0, 434.0]
46645 46646 0 27 1 3 48219 5 11 58.0 58.0 ... (2.0, 3.0] 1 (1.0, 2.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (57.0, 434.0]

5 rows × 34 columns

In [111]:
# Replace values in a column with another set of values
test_df['PA022_bin'] = test_df['PA022_bin'].replace({"(-1.0, 57.0]": '(-1.0, 59.0]', '(57.0, 434.0]': '(59.0, 448.0]'})
In [112]:
test_df['PA022_bin'].value_counts(dropna=False)
Out[112]:
(-99.001, -1.0]    10407
(59.0, 448.0]       3147
(-1.0, 59.0]        2377
NoData                69
Name: PA022_bin, dtype: int64
In [113]:
test_df_WOE_PA022 = pd.merge(test_df, k[['PA022_bin', 'PA022_bin_WOE']],
                             left_on='PA022_bin',
                             right_on='PA022_bin', how='left')
test_df_WOE_PA022.head(10)
Out[113]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD006_bin_labels TD006_bin TD009_bin_labels TD009_bin TD010_bin_labels TD010_bin TD014_bin_labels TD014_bin PA022_bin PA022_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 0 (-0.001, 1.0] 0 (2.0, 4.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 59.0] 0.22
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 0 (-0.001, 1.0] 3 (8.0, 46.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (-99.001, -1.0] -0.15
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 59.0] 0.22
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 2 (2.0, 21.0] 0 (2.0, 4.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (59.0, 448.0] 0.28
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 1 (1.0, 2.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (59.0, 448.0] 0.28
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 0 (-0.001, 1.0] 3 (8.0, 46.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (-99.001, -1.0] -0.15
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 2 (2.0, 21.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (59.0, 448.0] 0.28
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 0 (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (59.0, 448.0] 0.28
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 0 (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 3 (4.0, 43.0] (-99.001, -1.0] -0.15
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 0 (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 2 (1.0, 2.0] (-1.0, 59.0] 0.22

10 rows × 35 columns

In [114]:
nan_check = test_df_WOE_PA022['PA022_bin_WOE'].isna()
nan_values = test_df_WOE_PA022['PA022_bin_WOE'][nan_check]
nan_values
Out[114]:
Series([], Name: PA022_bin_WOE, dtype: float64)

PA023 (DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_CALL, Call Detail):¶

  • This numeric variable represents the number of days between the loan application and the first collection call.
  • It provides insights into the time lapse between applying for the loan and the initiation of collection calls. A longer duration may indicate delayed collection activities, potentially affecting the loan default risk.
In [115]:
k = WOE('PA022')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[115]:
PA022 Count Good Bad Good % Bad % PA022_WOE
125 123.0 4 0 4 0.00 0.01 -inf
123 121.0 10 1 9 0.01 0.02 -0.69
53 51.0 177 23 154 0.19 0.30 -0.46
0 -99.0 1196 179 1017 1.45 1.97 -0.31
74 72.0 195 29 166 0.24 0.32 -0.29
... ... ... ... ... ... ... ...
155 426.0 1 0 1 0.00 0.00 NaN
157 437.0 1 0 1 0.00 0.00 NaN
158 440.0 1 0 1 0.00 0.00 NaN
159 441.0 1 0 1 0.00 0.00 NaN
161 448.0 1 0 1 0.00 0.00 NaN

163 rows × 7 columns

In [116]:
#Bin the train data
#Convert the 'PA023' column to numeric values, and any non-numeric values (including 'NoData') will be replaced with NaN using the errors='coerce'
train_df['PA023'] = pd.to_numeric(train_df['PA023'], errors='coerce')
train_df['PA023_bin'] = pd.qcut(train_df['PA023'],5,duplicates='drop').values.add_categories("NoData")
train_df['PA023_bin'] = train_df['PA023_bin'].fillna("NoData").astype(str)
train_df['PA023_bin'].value_counts(dropna=False)
Out[116]:
(-99.001, -1.0]    46059
(41.0, 448.0]      12715
(-1.0, 41.0]        4914
NoData               312
Name: PA023_bin, dtype: int64
In [117]:
k = WOE('PA023_bin')
k
Out[117]:
PA023_bin Count Good Bad Good % Bad % PA023_bin_WOE
1 (-99.001, -1.0] 46059 7997 38062 64.82 73.68 -0.13
0 (-1.0, 41.0] 4914 1165 3749 9.44 7.26 0.26
2 (41.0, 448.0] 12715 3097 9618 25.10 18.62 0.30
3 NoData 312 79 233 0.64 0.45 0.35
In [118]:
train_df_WOE_PA023 = pd.merge(train_df, k[['PA023_bin','PA023_bin_WOE']],
                             left_on='PA023_bin',
                             right_on='PA023_bin', how='left')
train_df_WOE_PA023.head(10)
Out[118]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... CR009_bin CR015_bin TD001_bin TD006_bin TD009_bin TD010_bin TD014_bin PA022_bin PA023_bin PA023_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (4.0, 43.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... (-0.001, 2500.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (59.0, 448.0] (41.0, 448.0] 0.30
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... (24221.8, 50000.0] (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] (1.0, 2.0] (1.0, 2.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... (11484.4, 24221.8] (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... (50000.0, 1420300.0] (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... (24221.8, 50000.0] (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... (-0.001, 2500.0] (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] (5.0, 8.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... (-0.001, 2500.0] (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] (5.0, 8.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... (11484.4, 24221.8] (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (8.0, 46.0] (-0.001, 1.0] (1.0, 2.0] (-1.0, 59.0] (-1.0, 41.0] 0.26
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... (50000.0, 1420300.0] (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (4.0, 5.0] (2.0, 3.0] (2.0, 4.0] (59.0, 448.0] (-99.001, -1.0] -0.13

10 rows × 28 columns

In [119]:
nan_check= train_df_WOE_PA023['PA023_bin_WOE'].isna()
nan_values = train_df_WOE_PA023['PA023_bin_WOE'][nan_check]
nan_values
Out[119]:
Series([], Name: PA023_bin_WOE, dtype: float64)
In [120]:
test_df['PA023'] = pd.to_numeric(test_df['PA023'], errors='coerce')
test_df['PA023_bin'] = pd.qcut(test_df['PA023'],5,duplicates='drop').values.add_categories("NoData")
test_df['PA023_bin'] = test_df['PA023_bin'].fillna("NoData").astype(str)
test_df['PA023_bin'].value_counts(dropna=False)
Out[120]:
(-99.001, -1.0]    11479
(39.0, 434.0]       3174
(-1.0, 39.0]        1278
NoData                69
Name: PA023_bin, dtype: int64
In [121]:
# Replace values in a column with another set of values
test_df['PA023_bin'] = test_df['PA023_bin'].replace({"(-1.0, 39.0]": "(-1.0, 41.0]","(39.0, 434.0]": "(41.0, 448.0]"})
test_df['PA023_bin'].value_counts(dropna=False)
Out[121]:
(-99.001, -1.0]    11479
(41.0, 448.0]       3174
(-1.0, 41.0]        1278
NoData                69
Name: PA023_bin, dtype: int64
In [122]:
test_df_WOE_PA023 = pd.merge(test_df, k[['PA023_bin', 'PA023_bin_WOE']],
                             left_on='PA023_bin',
                             right_on='PA023_bin', how='left')
test_df_WOE_PA023.head(10)
Out[122]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD006_bin TD009_bin_labels TD009_bin TD010_bin_labels TD010_bin TD014_bin_labels TD014_bin PA022_bin PA023_bin PA023_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... (-0.001, 1.0] 0 (2.0, 4.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 59.0] (-1.0, 41.0] 0.26
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... (-0.001, 1.0] 3 (8.0, 46.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 59.0] (41.0, 448.0] 0.30
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... (2.0, 21.0] 0 (2.0, 4.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (59.0, 448.0] (41.0, 448.0] 0.30
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... (1.0, 2.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (59.0, 448.0] (41.0, 448.0] 0.30
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... (-0.001, 1.0] 3 (8.0, 46.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... (2.0, 21.0] 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (59.0, 448.0] (41.0, 448.0] 0.30
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... (-0.001, 1.0] 1 (-0.001, 2.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (59.0, 448.0] (41.0, 448.0] 0.30
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 3 (4.0, 43.0] (-99.001, -1.0] (-99.001, -1.0] -0.13
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... (-0.001, 1.0] 3 (8.0, 46.0] 1 (1.0, 2.0] 2 (1.0, 2.0] (-1.0, 59.0] (-1.0, 41.0] 0.26

10 rows × 36 columns

In [123]:
nan_check = test_df_WOE_PA023['PA023_bin_WOE'].isna()
nan_values = test_df_WOE_PA023['PA023_bin_WOE'][nan_check]
nan_values
Out[123]:
Series([], Name: PA023_bin_WOE, dtype: float64)

PA029 (AVG_LEN_COLLECTION_OR_HIGH_RISK_INBOUND_CALLS, Call Detail): ¶

  • This numeric variable represents the average length of collection or high-risk inbound calls.
  • It provides insights into the average duration of calls related to collections or high-risk situations. Higher values suggest longer conversations, indicating potentially more in-depth discussions related to repayment or risk mitigation strategies.
In [124]:
k = WOE('PA029')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[124]:
PA029 Count Good Bad Good % Bad % PA029_WOE
2516 142.5 6 0 6 0.0 0.01 -inf
778 45.25 3 0 3 0.0 0.01 -inf
1587 77.666667 6 0 6 0.0 0.01 -inf
1614 79.2 3 0 3 0.0 0.01 -inf
2988 221.5 5 0 5 0.0 0.01 -inf
... ... ... ... ... ... ... ...
3583 1462.0 1 0 1 0.0 0.00 NaN
3584 1614.0 1 0 1 0.0 0.00 NaN
3585 1757.0 1 0 1 0.0 0.00 NaN
3586 1919.0 1 0 1 0.0 0.00 NaN
3588 2872.0 1 0 1 0.0 0.00 NaN

3590 rows × 7 columns

In [125]:
#Bin the train data
#Convert the 'PA029' column to numeric values, and any non-numeric values (including 'NoData') will be replaced with NaN using the errors='coerce'
train_df['PA029'] = pd.to_numeric(train_df['PA029'], errors='coerce')
train_df['PA029_bin'] = pd.qcut(train_df['PA029'],5,duplicates='drop').values.add_categories("NoData")
train_df['PA029_bin'] = train_df['PA029_bin'].fillna("NoData").astype(str)
train_df['PA029_bin'].value_counts(dropna=False)
Out[125]:
(-99.001, -98.0]    43718
(40.0, 2872.0]      12674
(-98.0, 40.0]        7296
NoData                312
Name: PA029_bin, dtype: int64
In [126]:
k = WOE('PA029_bin')
k
Out[126]:
PA029_bin Count Good Bad Good % Bad % PA029_bin_WOE
1 (-99.001, -98.0] 43718 7545 36173 61.15 70.02 -0.14
0 (-98.0, 40.0] 7296 1493 5803 12.10 11.23 0.07
3 NoData 312 79 233 0.64 0.45 0.35
2 (40.0, 2872.0] 12674 3221 9453 26.11 18.30 0.36
In [127]:
train_df_WOE_PA029 = pd.merge(train_df, k[['PA029_bin','PA029_bin_WOE']],
                             left_on='PA029_bin',
                             right_on='PA029_bin', how='left')
train_df_WOE_PA029.head(10)
Out[127]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... CR015_bin TD001_bin TD006_bin TD009_bin TD010_bin TD014_bin PA022_bin PA023_bin PA029_bin PA029_bin_WOE
0 3823 0 29 4 2 37635 5 5 -1.0 -1.0 ... (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (4.0, 43.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
1 35563 1 47 1 2 0 6 12 87.0 87.0 ... (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (59.0, 448.0] (41.0, 448.0] (-98.0, 40.0] 0.07
2 4884 0 31 1 5 47506 5 12 -1.0 -1.0 ... (4.0, 5.0] (1.0, 2.0] (-0.001, 1.0] (5.0, 8.0] (1.0, 2.0] (1.0, 2.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
3 71171 0 29 3 4 22037 6 5 -1.0 -1.0 ... (5.0, 6.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
4 25666 0 35 4 3 67400 6 7 -1.0 -1.0 ... (5.0, 6.0] (1.0, 2.0] (-0.001, 1.0] (-0.001, 2.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
5 8007 0 30 3 2 26917 5 4 -1.0 -1.0 ... (4.0, 5.0] (3.0, 20.0] (2.0, 21.0] (8.0, 46.0] (3.0, 35.0] (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
6 62227 0 35 1 5 0 6 3 -1.0 -1.0 ... (5.0, 6.0] (2.0, 3.0] (-0.001, 1.0] (5.0, 8.0] (2.0, 3.0] (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
7 12634 0 25 1 5 0 3 5 -1.0 -1.0 ... (1.999, 4.0] (3.0, 20.0] (-0.001, 1.0] (5.0, 8.0] (-0.001, 1.0] (-0.001, 1.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
8 56100 1 26 3 5 20799 5 5 12.0 12.0 ... (4.0, 5.0] (3.0, 20.0] (-0.001, 1.0] (8.0, 46.0] (-0.001, 1.0] (1.0, 2.0] (-1.0, 59.0] (-1.0, 41.0] (40.0, 2872.0] 0.36
9 33174 0 37 1 3 55000 5 7 69.0 -1.0 ... (4.0, 5.0] (1.0, 2.0] (2.0, 21.0] (4.0, 5.0] (2.0, 3.0] (2.0, 4.0] (59.0, 448.0] (-99.001, -1.0] (40.0, 2872.0] 0.36

10 rows × 29 columns

In [128]:
nan_check= train_df_WOE_PA029['PA029_bin_WOE'].isna()
nan_values = train_df_WOE_PA029['PA029_bin_WOE'][nan_check]
nan_values
Out[128]:
Series([], Name: PA029_bin_WOE, dtype: float64)
In [129]:
test_df['PA029'] = pd.to_numeric(test_df['PA029'], errors='coerce')
test_df['PA029_bin'] = pd.qcut(test_df['PA029'],5,duplicates='drop').values.add_categories("NoData")
test_df['PA029_bin'] = test_df['PA029_bin'].fillna("NoData").astype(str)
test_df['PA029_bin'].value_counts(dropna=False)
Out[129]:
(-99.001, -98.0]    10902
(40.2, 1767.75]      3186
(-98.0, 40.2]        1843
NoData                 69
Name: PA029_bin, dtype: int64
In [130]:
# Replace values in a column with another set of values
test_df['PA029_bin'] = test_df['PA029_bin'].replace({"(40.2, 1767.75]": "(40.0, 2872.0]","(-98.0, 40.2]": "(-98.0, 40.0]"})
test_df['PA029_bin'].value_counts(dropna=False)
Out[130]:
(-99.001, -98.0]    10902
(40.0, 2872.0]       3186
(-98.0, 40.0]        1843
NoData                 69
Name: PA029_bin, dtype: int64
In [131]:
test_df_WOE_PA029 = pd.merge(test_df, k[['PA029_bin', 'PA029_bin_WOE']],
                             left_on='PA029_bin',
                             right_on='PA029_bin', how='left')
test_df_WOE_PA029.head(10)
Out[131]:
id loan_default AP001 AP003 AP008 CR009 CR015 CR019 PA022 PA023 ... TD009_bin_labels TD009_bin TD010_bin_labels TD010_bin TD014_bin_labels TD014_bin PA022_bin PA023_bin PA029_bin PA029_bin_WOE
0 47045 0 30 3 3 10000 5 5 25.0 25.0 ... 0 (2.0, 4.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 59.0] (-1.0, 41.0] (-99.001, -98.0] -0.14
1 44296 0 33 3 5 27288 5 5 -1.0 -1.0 ... 3 (8.0, 46.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
2 74784 0 29 4 5 33000 5 11 51.0 51.0 ... 1 (-0.001, 2.0] 0 (-0.001, 1.0] 0 (-0.001, 1.0] (-1.0, 59.0] (41.0, 448.0] (-98.0, 40.0] 0.07
3 70976 1 28 1 5 3000 5 3 85.0 85.0 ... 0 (2.0, 4.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (59.0, 448.0] (41.0, 448.0] (40.0, 2872.0] 0.36
4 46646 0 27 1 3 48219 5 11 58.0 58.0 ... 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (59.0, 448.0] (41.0, 448.0] (40.0, 2872.0] 0.36
5 8216 0 33 4 1 5000 6 11 -1.0 -1.0 ... 3 (8.0, 46.0] 2 (3.0, 35.0] 2 (1.0, 2.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
6 65510 0 23 3 1 8100 2 3 75.0 75.0 ... 4 (4.0, 5.0] 3 (2.0, 3.0] 3 (4.0, 43.0] (59.0, 448.0] (41.0, 448.0] (40.0, 2872.0] 0.36
7 62716 0 36 1 3 0 5 3 115.0 115.0 ... 1 (-0.001, 2.0] 0 (-0.001, 1.0] 1 (2.0, 4.0] (59.0, 448.0] (41.0, 448.0] (-98.0, 40.0] 0.07
8 39860 0 21 3 3 17110 5 8 -1.0 -1.0 ... 3 (8.0, 46.0] 1 (1.0, 2.0] 3 (4.0, 43.0] (-99.001, -1.0] (-99.001, -1.0] (-99.001, -98.0] -0.14
9 58835 0 24 3 2 60877 5 10 52.0 23.0 ... 3 (8.0, 46.0] 1 (1.0, 2.0] 2 (1.0, 2.0] (-1.0, 59.0] (-1.0, 41.0] (40.0, 2872.0] 0.36

10 rows × 37 columns

In [132]:
nan_check = test_df_WOE_PA029['PA029_bin_WOE'].isna()
nan_values = test_df_WOE_PA029['PA029_bin_WOE'][nan_check]
nan_values
Out[132]:
Series([], Name: PA029_bin_WOE, dtype: float64)

Merge all data by id¶

In [133]:
column_names = train_df.columns.tolist()
print(column_names)
['id', 'loan_default', 'AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014', 'AP003_bin', 'CR009_bin', 'CR015_bin', 'TD001_bin', 'TD006_bin', 'TD009_bin', 'TD010_bin', 'TD014_bin', 'PA022_bin', 'PA023_bin', 'PA029_bin']
In [134]:
column_names1 = test_df.columns.tolist()
print(column_names1)
['id', 'loan_default', 'AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014', 'AP003_bin_labels', 'AP003_bin', 'CR009_bin_labels', 'CR009_bin', 'CR015_bin_labels', 'CR015_bin', 'TD001_bin_labels', 'TD001_bin', 'TD006_bin_labels', 'TD006_bin', 'TD009_bin_labels', 'TD009_bin', 'TD010_bin_labels', 'TD010_bin', 'TD014_bin_labels', 'TD014_bin', 'PA022_bin', 'PA023_bin', 'PA029_bin']
In [135]:
#train_df.target = train_df['id', 'loan_default']
train_df.target = train_df.drop(columns=train_df.columns.difference(['loan_default']))
test_df.target = test_df.drop(columns=test_df.columns.difference(['loan_default']))
#test_df.target = test_df['id','loan_default']
/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/ipykernel_11511/245799215.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  train_df.target = train_df.drop(columns=train_df.columns.difference(['loan_default']))
/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/ipykernel_11511/245799215.py:3: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  test_df.target = test_df.drop(columns=test_df.columns.difference(['loan_default']))
In [136]:
test_df.target
Out[136]:
loan_default
47044 0
44295 0
74783 0
70975 1
46645 0
... ...
67666 0
51146 0
42494 1
52517 0
7754 0

16000 rows × 1 columns

In [137]:
train_df.target
Out[137]:
loan_default
3822 0
35562 1
4883 0
71170 0
25665 0
... ...
6265 0
54886 0
76820 0
860 1
15795 0

64000 rows × 1 columns

In [138]:
#train_df.drop('loan_default', 'AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014', 'PA022_bin', 'PA023_bin', 'PA029_bin', 'TD001_bin', 'TD006_bin', 'TD009_bin', 'TD010_bin', 'TD014_bin', 'AP003_bin', 'CR009_bin', 'CR015_bin', axis=1, inplace=True)

train_df_WOE= train_df.drop(columns=train_df.columns.difference(['id']))
In [139]:
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_AP001[['id',"AP001_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_AP003[['id',"AP003_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_AP008[['id',"AP008_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_CR009[['id',"CR009_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_CR015[['id',"CR015_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_CR019[['id',"CR019_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_PA022[['id',"PA022_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_PA023[['id',"PA023_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_PA029[['id',"PA029_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD001[['id',"TD001_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD005[['id',"TD005_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD006[['id',"TD006_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD009[['id',"TD009_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD010[['id',"TD010_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD014[['id',"TD014_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
In [140]:
train_df_WOE
Out[140]:
id AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
0 3823 -0.03 -0.50 -0.09 0.07 0.08 0.02 -0.15 -0.13 -0.14 0.39 0.41 0.40 0.49 0.45 0.48
1 35563 -0.04 0.07 -0.09 -0.09 -0.27 -0.22 0.28 0.30 0.07 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
2 4884 0.01 0.07 0.11 0.07 0.08 -0.22 -0.15 -0.13 -0.14 0.02 -0.03 -0.14 0.17 0.00 -0.08
3 71171 -0.03 0.07 0.09 0.07 -0.27 0.02 -0.15 -0.13 -0.14 0.39 0.59 0.40 0.49 0.16 0.14
4 25666 -0.09 -0.50 0.02 -0.14 -0.27 -0.01 -0.15 -0.13 -0.14 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63995 6266 0.04 0.07 0.02 0.07 0.08 0.12 -0.15 -0.13 -0.14 0.39 0.04 -0.14 0.04 -0.24 -0.08
63996 54887 0.01 0.07 0.09 -0.14 -0.27 0.02 0.28 -0.13 0.07 0.02 0.04 -0.14 0.04 -0.24 -0.30
63997 76821 0.04 0.07 -0.09 0.07 0.08 0.12 -0.15 -0.13 -0.14 0.02 0.69 0.40 0.49 0.16 0.14
63998 861 0.04 0.07 0.11 -0.14 -0.27 -0.20 -0.15 -0.13 -0.14 -0.24 -0.22 0.11 0.17 0.00 -0.08
63999 15796 0.10 0.07 0.09 0.08 0.08 0.14 -0.15 -0.13 -0.14 -0.24 -0.51 -0.14 -0.18 -0.24 -0.30

64000 rows × 16 columns

In [141]:
test_df_WOE= test_df.drop(columns=test_df.columns.difference(['id']))
In [142]:
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_AP001[['id', 'AP001_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_AP003[['id', 'AP003_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_AP008[['id', 'AP008_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_CR009[['id', 'CR009_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_CR015[['id', 'CR015_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_CR019[['id', 'CR019_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_PA022[['id', 'PA022_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_PA023[['id', 'PA023_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_PA029[['id', 'PA029_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD001[['id', 'TD001_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD005[['id', 'TD005_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD006[['id', 'TD006_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD009[['id', 'TD009_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD010[['id', 'TD010_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD014[['id', 'TD014_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
In [143]:
test_df_WOE
Out[143]:
id AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
0 47045 0.04 0.07 0.02 -0.09 -0.27 0.02 0.22 0.26 -0.14 0.02 -0.22 -0.14 -0.18 -0.24 -0.30
1 44296 -0.04 0.07 0.11 -0.14 -0.27 0.02 -0.15 -0.13 -0.14 0.02 0.04 -0.14 0.49 -0.24 0.14
2 74784 -0.03 -0.50 0.11 -0.14 -0.27 -0.20 0.22 0.30 0.07 -0.24 -0.03 -0.14 -0.49 -0.24 -0.30
3 70976 0.04 0.07 0.11 -0.09 -0.27 0.12 0.28 0.30 0.36 -0.24 -0.51 0.40 -0.18 0.45 -0.08
4 46646 0.10 0.07 0.02 -0.14 -0.27 -0.20 0.28 0.30 0.36 0.12 0.39 0.11 0.04 0.16 0.48
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15995 67667 -0.08 0.07 0.11 -0.14 0.19 -0.20 0.22 0.30 0.07 0.02 -0.03 0.11 -0.49 0.16 -0.08
15996 51147 -0.10 0.07 -0.09 -0.14 0.19 0.14 0.28 0.30 0.36 0.12 0.58 0.40 0.04 0.16 0.48
15997 42495 0.01 0.07 -0.09 0.07 -0.27 0.12 -0.15 -0.13 -0.14 0.39 -0.03 -0.14 -0.49 -0.24 0.14
15998 52518 -0.03 0.07 -0.20 -0.09 0.08 0.14 -0.15 -0.13 -0.14 0.39 -0.03 -0.14 -0.49 -0.24 0.14
15999 7755 -0.07 0.07 -0.09 0.08 0.19 -0.06 -0.15 -0.13 -0.14 0.02 0.23 -0.14 0.04 0.45 0.48

16000 rows × 16 columns

In [144]:
column_names = train_df_WOE.columns.tolist()
print(column_names)
['id', 'AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
In [145]:
column_names = train_df_WOE.columns.tolist()
print(column_names)
['id', 'AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
In [146]:
test_df_WOE_withoutid = test_df_WOE.drop("id", axis=1)
test_df_WOE_withoutid
Out[146]:
AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
0 0.04 0.07 0.02 -0.09 -0.27 0.02 0.22 0.26 -0.14 0.02 -0.22 -0.14 -0.18 -0.24 -0.30
1 -0.04 0.07 0.11 -0.14 -0.27 0.02 -0.15 -0.13 -0.14 0.02 0.04 -0.14 0.49 -0.24 0.14
2 -0.03 -0.50 0.11 -0.14 -0.27 -0.20 0.22 0.30 0.07 -0.24 -0.03 -0.14 -0.49 -0.24 -0.30
3 0.04 0.07 0.11 -0.09 -0.27 0.12 0.28 0.30 0.36 -0.24 -0.51 0.40 -0.18 0.45 -0.08
4 0.10 0.07 0.02 -0.14 -0.27 -0.20 0.28 0.30 0.36 0.12 0.39 0.11 0.04 0.16 0.48
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15995 -0.08 0.07 0.11 -0.14 0.19 -0.20 0.22 0.30 0.07 0.02 -0.03 0.11 -0.49 0.16 -0.08
15996 -0.10 0.07 -0.09 -0.14 0.19 0.14 0.28 0.30 0.36 0.12 0.58 0.40 0.04 0.16 0.48
15997 0.01 0.07 -0.09 0.07 -0.27 0.12 -0.15 -0.13 -0.14 0.39 -0.03 -0.14 -0.49 -0.24 0.14
15998 -0.03 0.07 -0.20 -0.09 0.08 0.14 -0.15 -0.13 -0.14 0.39 -0.03 -0.14 -0.49 -0.24 0.14
15999 -0.07 0.07 -0.09 0.08 0.19 -0.06 -0.15 -0.13 -0.14 0.02 0.23 -0.14 0.04 0.45 0.48

16000 rows × 15 columns

In [147]:
train_df_WOE_withoutid = train_df_WOE.drop("id", axis=1)
train_df_WOE_withoutid
Out[147]:
AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
0 -0.03 -0.50 -0.09 0.07 0.08 0.02 -0.15 -0.13 -0.14 0.39 0.41 0.40 0.49 0.45 0.48
1 -0.04 0.07 -0.09 -0.09 -0.27 -0.22 0.28 0.30 0.07 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
2 0.01 0.07 0.11 0.07 0.08 -0.22 -0.15 -0.13 -0.14 0.02 -0.03 -0.14 0.17 0.00 -0.08
3 -0.03 0.07 0.09 0.07 -0.27 0.02 -0.15 -0.13 -0.14 0.39 0.59 0.40 0.49 0.16 0.14
4 -0.09 -0.50 0.02 -0.14 -0.27 -0.01 -0.15 -0.13 -0.14 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63995 0.04 0.07 0.02 0.07 0.08 0.12 -0.15 -0.13 -0.14 0.39 0.04 -0.14 0.04 -0.24 -0.08
63996 0.01 0.07 0.09 -0.14 -0.27 0.02 0.28 -0.13 0.07 0.02 0.04 -0.14 0.04 -0.24 -0.30
63997 0.04 0.07 -0.09 0.07 0.08 0.12 -0.15 -0.13 -0.14 0.02 0.69 0.40 0.49 0.16 0.14
63998 0.04 0.07 0.11 -0.14 -0.27 -0.20 -0.15 -0.13 -0.14 -0.24 -0.22 0.11 0.17 0.00 -0.08
63999 0.10 0.07 0.09 0.08 0.08 0.14 -0.15 -0.13 -0.14 -0.24 -0.51 -0.14 -0.18 -0.24 -0.30

64000 rows × 15 columns

Section 3 Randon Forest¶

Process data for modeling¶

In [148]:
train_df_rf= train_df.drop(columns=train_df.columns.difference(['id', 'loan_default']))
In [149]:
train_df_rf= pd.merge(train_df_rf, train_df_WOE_AP001[['id',"AP001_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_AP003[['id',"AP003_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_AP008[['id',"AP008_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_CR009[['id',"CR009_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_CR015[['id',"CR015_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_CR019[['id',"CR019_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_PA022[['id',"PA022_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_PA023[['id',"PA023_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_PA029[['id',"PA029_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD001[['id',"TD001_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf =pd.merge(train_df_rf, train_df_WOE_TD005[['id',"TD005_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD006[['id',"TD006_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD009[['id',"TD009_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD010[['id',"TD010_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD014[['id',"TD014_bin_WOE"]],
                             left_on='id',
                             right_on='id', how='left')
In [150]:
train_df_rf
Out[150]:
id loan_default AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
0 3823 0 -0.03 -0.50 -0.09 0.07 0.08 0.02 -0.15 -0.13 -0.14 0.39 0.41 0.40 0.49 0.45 0.48
1 35563 1 -0.04 0.07 -0.09 -0.09 -0.27 -0.22 0.28 0.30 0.07 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
2 4884 0 0.01 0.07 0.11 0.07 0.08 -0.22 -0.15 -0.13 -0.14 0.02 -0.03 -0.14 0.17 0.00 -0.08
3 71171 0 -0.03 0.07 0.09 0.07 -0.27 0.02 -0.15 -0.13 -0.14 0.39 0.59 0.40 0.49 0.16 0.14
4 25666 0 -0.09 -0.50 0.02 -0.14 -0.27 -0.01 -0.15 -0.13 -0.14 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63995 6266 0 0.04 0.07 0.02 0.07 0.08 0.12 -0.15 -0.13 -0.14 0.39 0.04 -0.14 0.04 -0.24 -0.08
63996 54887 0 0.01 0.07 0.09 -0.14 -0.27 0.02 0.28 -0.13 0.07 0.02 0.04 -0.14 0.04 -0.24 -0.30
63997 76821 0 0.04 0.07 -0.09 0.07 0.08 0.12 -0.15 -0.13 -0.14 0.02 0.69 0.40 0.49 0.16 0.14
63998 861 1 0.04 0.07 0.11 -0.14 -0.27 -0.20 -0.15 -0.13 -0.14 -0.24 -0.22 0.11 0.17 0.00 -0.08
63999 15796 0 0.10 0.07 0.09 0.08 0.08 0.14 -0.15 -0.13 -0.14 -0.24 -0.51 -0.14 -0.18 -0.24 -0.30

64000 rows × 17 columns

In [151]:
test_df_rf= test_df.drop(columns=test_df.columns.difference(['id','loan_default']))
In [152]:
test_df_rf = pd.merge(test_df_rf, test_df_WOE_AP001[['id', 'AP001_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_AP003[['id', 'AP003_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_AP008[['id', 'AP008_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_CR009[['id', 'CR009_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_CR015[['id', 'CR015_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_CR019[['id', 'CR019_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_PA022[['id', 'PA022_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_PA023[['id', 'PA023_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_PA029[['id', 'PA029_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD001[['id', 'TD001_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD005[['id', 'TD005_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD006[['id', 'TD006_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD009[['id', 'TD009_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD010[['id', 'TD010_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD014[['id', 'TD014_bin_WOE']],
                      left_on='id',
                      right_on='id', how='left')
In [153]:
test_df_rf
Out[153]:
id loan_default AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
0 47045 0 0.04 0.07 0.02 -0.09 -0.27 0.02 0.22 0.26 -0.14 0.02 -0.22 -0.14 -0.18 -0.24 -0.30
1 44296 0 -0.04 0.07 0.11 -0.14 -0.27 0.02 -0.15 -0.13 -0.14 0.02 0.04 -0.14 0.49 -0.24 0.14
2 74784 0 -0.03 -0.50 0.11 -0.14 -0.27 -0.20 0.22 0.30 0.07 -0.24 -0.03 -0.14 -0.49 -0.24 -0.30
3 70976 1 0.04 0.07 0.11 -0.09 -0.27 0.12 0.28 0.30 0.36 -0.24 -0.51 0.40 -0.18 0.45 -0.08
4 46646 0 0.10 0.07 0.02 -0.14 -0.27 -0.20 0.28 0.30 0.36 0.12 0.39 0.11 0.04 0.16 0.48
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15995 67667 0 -0.08 0.07 0.11 -0.14 0.19 -0.20 0.22 0.30 0.07 0.02 -0.03 0.11 -0.49 0.16 -0.08
15996 51147 0 -0.10 0.07 -0.09 -0.14 0.19 0.14 0.28 0.30 0.36 0.12 0.58 0.40 0.04 0.16 0.48
15997 42495 1 0.01 0.07 -0.09 0.07 -0.27 0.12 -0.15 -0.13 -0.14 0.39 -0.03 -0.14 -0.49 -0.24 0.14
15998 52518 0 -0.03 0.07 -0.20 -0.09 0.08 0.14 -0.15 -0.13 -0.14 0.39 -0.03 -0.14 -0.49 -0.24 0.14
15999 7755 0 -0.07 0.07 -0.09 0.08 0.19 -0.06 -0.15 -0.13 -0.14 0.02 0.23 -0.14 0.04 0.45 0.48

16000 rows × 17 columns

In [154]:
import numpy as np
import datetime
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
In [155]:
from matplotlib.rcsetup import validate_aspect
#Use WOE transformed features to run model
#train_df_WOE & test_df_WOE
train_df_rf.shape
Out[155]:
(64000, 17)
In [156]:
test_df_rf.shape
Out[156]:
(16000, 17)
In [157]:
var = pd.DataFrame(train_df_rf.dtypes)
var
Out[157]:
0
id int64
loan_default int64
AP001_WOE float64
AP003_bin_WOE float64
AP008_WOE float64
CR009_bin_WOE float64
CR015_bin_WOE float64
CR019_WOE float64
PA022_bin_WOE float64
PA023_bin_WOE float64
PA029_bin_WOE float64
TD001_bin_WOE float64
TD005_WOE float64
TD006_bin_WOE float64
TD009_bin_WOE float64
TD010_bin_WOE float64
TD014_bin_WOE float64
In [158]:
pip install h2o
Requirement already satisfied: h2o in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (3.42.0.1)
Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from h2o) (2.28.2)
Requirement already satisfied: tabulate in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from h2o) (0.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (2022.12.7)

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
In [159]:
import h2o
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
H2O_cluster_uptime: 8 hours 9 mins
H2O_cluster_timezone: Asia/Taipei
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.42.0.1
H2O_cluster_version_age: 1 month and 27 days
H2O_cluster_name: H2O_from_python_yientseng_hm4qux
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 1.469 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.11.1 final
In [160]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
In [161]:
target='loan_default'

Modeling best practices¶

  • When you model, you should run with a small sample dataset
  • Try to write repeating code in a function
In [ ]:
train_smpl = train_df_rf.sample(frac=0.1, random_state=1)
test_smpl = test_df_rf.sample(frac=0.1, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
predictors = train_df_rf.columns.tolist()
predictors=predictors[2:17]
predictors
Out[ ]:
['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR009_bin_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD006_bin_WOE',
 'TD009_bin_WOE',
 'TD010_bin_WOE',
 'TD014_bin_WOE']
In [ ]:
rf_v1 = H2ORandomForestEstimator(
        model_id = 'rf_v1',
        ntrees = 300,
        nfolds=10,
        min_rows=100,
        seed=1234)
rf_v1.train(predictors,target,training_frame=train_hex)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: rf_v1
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
300.0 300.0 125312.0 7.0 12.0 8.963333 24.0 32.0 28.423334
ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.14951065232172528
RMSE: 0.3866660734040747
MAE: 0.2992474803933356
RMSLE: 0.2712098866294993
Mean Residual Deviance: 0.14951065232172528
ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.14963480040402838
RMSE: 0.3868265766516416
MAE: 0.299352536320045
RMSLE: 0.27131915462305656
Mean Residual Deviance: 0.14963480040402838
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
mae 0.2993906 0.0082924 0.3132309 0.2944414 0.3019102 0.2925448 0.3000576 0.3052748 0.2901710 0.3090672 0.2877329 0.2994755
mean_residual_deviance 0.1496330 0.0094428 0.1652384 0.1433392 0.1545313 0.1437271 0.1504920 0.1577827 0.1377648 0.1591601 0.1371658 0.1471285
mse 0.1496330 0.0094428 0.1652384 0.1433392 0.1545313 0.1437271 0.1504920 0.1577827 0.1377648 0.1591601 0.1371658 0.1471285
r2 0.0231276 0.0151769 0.0032182 0.0241438 0.0142291 0.0122233 0.0187475 0.0222429 0.0333000 0.0565313 0.0124066 0.0342336
residual_deviance 0.1496330 0.0094428 0.1652384 0.1433392 0.1545313 0.1437271 0.1504920 0.1577827 0.1377648 0.1591601 0.1371658 0.1471285
rmse 0.3866515 0.0121850 0.4064952 0.3786016 0.3931047 0.3791136 0.3879329 0.3972187 0.3711669 0.3989487 0.3703589 0.3835733
rmsle 0.2712443 0.0065110 0.2830140 0.2670296 0.2748869 0.2674065 0.2722500 0.2767344 0.2627884 0.2759479 0.2629195 0.2694657
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 23:24:29 1 min 3.505 sec 0.0 nan nan nan
2023-07-26 23:24:29 1 min 3.573 sec 1.0 0.3875406 0.2971530 0.1501877
2023-07-26 23:24:30 1 min 3.615 sec 2.0 0.3890889 0.3008566 0.1513902
2023-07-26 23:24:30 1 min 3.630 sec 3.0 0.3882861 0.2995615 0.1507661
2023-07-26 23:24:30 1 min 3.645 sec 4.0 0.3869131 0.2988633 0.1497017
2023-07-26 23:24:30 1 min 3.660 sec 5.0 0.3873920 0.3000798 0.1500726
2023-07-26 23:24:30 1 min 3.674 sec 6.0 0.3873828 0.2999194 0.1500654
2023-07-26 23:24:30 1 min 3.690 sec 7.0 0.3879887 0.3004147 0.1505352
2023-07-26 23:24:30 1 min 3.706 sec 8.0 0.3881241 0.3000204 0.1506403
2023-07-26 23:24:30 1 min 3.729 sec 9.0 0.3876442 0.3000743 0.1502680
--- --- --- --- --- --- ---
2023-07-26 23:24:33 1 min 7.048 sec 291.0 0.3866741 0.2992291 0.1495169
2023-07-26 23:24:33 1 min 7.060 sec 292.0 0.3866675 0.2992232 0.1495117
2023-07-26 23:24:33 1 min 7.069 sec 293.0 0.3866678 0.2992219 0.1495120
2023-07-26 23:24:33 1 min 7.079 sec 294.0 0.3866633 0.2992180 0.1495085
2023-07-26 23:24:33 1 min 7.092 sec 295.0 0.3866660 0.2992240 0.1495106
2023-07-26 23:24:33 1 min 7.103 sec 296.0 0.3866689 0.2992322 0.1495128
2023-07-26 23:24:33 1 min 7.111 sec 297.0 0.3866656 0.2992397 0.1495103
2023-07-26 23:24:33 1 min 7.122 sec 298.0 0.3866640 0.2992462 0.1495091
2023-07-26 23:24:33 1 min 7.130 sec 299.0 0.3866674 0.2992554 0.1495117
2023-07-26 23:24:33 1 min 7.139 sec 300.0 0.3866661 0.2992475 0.1495107
[301 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 2103.5937500 1.0 0.2383542
TD005_WOE 1458.4207764 0.6932996 0.1652509
PA029_bin_WOE 1220.0726318 0.5799944 0.1382441
TD014_bin_WOE 594.2836304 0.2825087 0.0673371
CR019_WOE 551.7902832 0.2623084 0.0625223
CR015_bin_WOE 534.6726685 0.2541711 0.0605827
PA023_bin_WOE 413.6450500 0.1966373 0.0468693
AP003_bin_WOE 331.2231140 0.1574558 0.0375303
AP001_WOE 304.8792114 0.1449326 0.0345453
AP008_WOE 283.7080994 0.1348683 0.0321464
TD010_bin_WOE 267.6243286 0.1272224 0.0303240
TD001_bin_WOE 251.5751953 0.1195931 0.0285055
CR009_bin_WOE 246.8613586 0.1173522 0.0279714
PA022_bin_WOE 185.5974274 0.0882287 0.0210297
TD006_bin_WOE 77.5485992 0.0368648 0.0087869

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
def VarImp(model_name):

    from sklearn.metrics import roc_curve,auc
    from sklearn.metrics import average_precision_score
    from sklearn.metrics import precision_recall_curve
    import matplotlib.pyplot as plt

    # plot the variable importance
    plt.rcdefaults()
    variables = model_name._model_json['output']['variable_importances']['variable']
    y_pos = np.arange(len(variables))
    fig, ax = plt.subplots(figsize = (6,len(variables)/2))
    scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
    ax.barh(y_pos,scaled_importance,align='center',color='green')
    ax.set_yticks(y_pos)
    ax.set_yticklabels(variables)
    ax.invert_yaxis()
    ax.set_xlabel('Scaled Importance')
    ax.set_title('Variable Importance')
    plt.show()

VarImp(rf_v1)
In [ ]:
predictions = rf_v1.predict(test_hex)
predictions.head()
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
loan_default predict
0 0 0.278723
1 0 0.257221
2 0 0.209367
3 0 0.153741
4 0 0.215133
In [ ]:
def createGains(model):
    predictions = model.predict(test_hex)
    test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()

    #sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
    test_scores = test_scores.sort_values(by='predict',ascending=False)
    test_scores['row_id'] = range(0,0+len(test_scores))
    test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
    #see count by decile
    test_scores.loc[test_scores['decile'] == 10]=9
    test_scores['decile'].value_counts()

    #create gains table
    gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
    gains.columns = ['count','actual']
    gains

    #add features to gains table
    gains['non_actual'] = gains['count'] - gains['actual']
    gains['cum_count'] = gains['count'].cumsum()
    gains['cum_actual'] = gains['actual'].cumsum()
    gains['cum_non_actual'] = gains['non_actual'].cumsum()
    gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
    gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
    gains['if_random'] = np.max(gains['cum_actual']) /10
    gains['if_random'] = gains['if_random'].cumsum()
    gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
    gains['K_S'] = np.abs( gains['percent_cum_actual'] -  gains['percent_cum_non_actual'] ) * 100
    gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
    gains = pd.DataFrame(gains)
    return(gains)

createGains(rf_v1)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 160 47 113 160 47 113 0.16 0.09 30.0 1.57 7.0 29.38
1 160 42 118 320 89 231 0.30 0.18 60.0 1.48 12.0 27.81
2 160 41 119 480 130 350 0.43 0.27 90.0 1.44 16.0 27.08
3 160 35 125 640 165 475 0.55 0.37 120.0 1.38 18.0 25.78
4 160 31 129 800 196 604 0.65 0.46 150.0 1.31 19.0 24.50
5 160 24 136 960 220 740 0.73 0.57 180.0 1.22 16.0 22.92
6 160 23 137 1120 243 877 0.81 0.67 210.0 1.16 14.0 21.70
7 160 16 144 1280 259 1021 0.86 0.79 240.0 1.08 7.0 20.23
8 160 23 137 1440 282 1158 0.94 0.89 270.0 1.04 5.0 19.58
9 160 18 142 1600 300 1300 1.00 1.00 300.0 1.00 0.0 18.75
In [ ]:
def ROC_AUC(my_result,df,target):
    from sklearn.metrics import roc_curve,auc
    from sklearn.metrics import average_precision_score
    from sklearn.metrics import precision_recall_curve
    import matplotlib.pyplot as plt

    # ROC
    y_actual = df[target].as_data_frame()
    y_pred = my_result.predict(df).as_data_frame()
    fpr = list()
    tpr = list()
    roc_auc = list()
    fpr,tpr,_ = roc_curve(y_actual,y_pred)
    roc_auc = auc(fpr,tpr)

    # Precision-Recall
    average_precision = average_precision_score(y_actual,y_pred)

    print('')
    print('   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
    print('')
    print('	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
    print('')
    print('   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
    print('')

    # plotting
    plt.figure(figsize=(10,4))

    # ROC
    plt.subplot(1,2,1)
    plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
    plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
    plt.legend(loc='lower right')


    # Precision-Recall
    plt.subplot(1,2,2)
    precision,recall,_ = precision_recall_curve(y_actual,y_pred)
    plt.step(recall,precision,color='b',alpha=0.2,where='post')
    plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0,1.05])
    plt.xlim([0.0,1.0])
    plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
    plt.show()
In [ ]:
ROC_AUC(rf_v1,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Now the code works fine with the small dataset, we can model with the entire dataset.¶

Turn out it doesn't perform better with entire training dataset. One possible reason is smaller datasets are less likely to lead to overfitting as they force the model to generalize better.

In [ ]:
train_hex = h2o.H2OFrame(train_df_rf)
test_hex = h2o.H2OFrame(test_df_rf)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
rf_v2 = H2ORandomForestEstimator(
        model_id = 'rf_v2',
        ntrees = 300,
        nfolds=10,
        min_rows=100,
        seed=1234)
rf_v2.train(predictors,target,training_frame=train_hex)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: rf_v2
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
300.0 300.0 1073835.0 13.0 20.0 17.026667 263.0 297.0 280.53665
ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.1498718771489339
RMSE: 0.38713289339570967
MAE: 0.30003284286595944
RMSLE: 0.2715420816428299
Mean Residual Deviance: 0.1498718771489339
ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.1499180749524892
RMSE: 0.3871925553939399
MAE: 0.30018390703180164
RMSLE: 0.271583976669591
Mean Residual Deviance: 0.1499180749524892
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
mae 0.3001959 0.0035226 0.3043954 0.3046928 0.3026648 0.297634 0.2955310 0.2948277 0.3015419 0.2978897 0.3019386 0.3008434
mean_residual_deviance 0.1499312 0.0035814 0.1541525 0.1549230 0.1515166 0.1477617 0.1460344 0.1438500 0.1517742 0.1474197 0.1518276 0.1500522
mse 0.1499312 0.0035814 0.1541525 0.1549230 0.1515166 0.1477617 0.1460344 0.1438500 0.1517742 0.1474197 0.1518276 0.1500522
r2 0.0363812 0.0029430 0.0362065 0.0338884 0.0395042 0.0345291 0.0313030 0.0379271 0.0411318 0.0376920 0.0341692 0.0374612
residual_deviance 0.1499312 0.0035814 0.1541525 0.1549230 0.1515166 0.1477617 0.1460344 0.1438500 0.1517742 0.1474197 0.1518276 0.1500522
rmse 0.3871846 0.0046316 0.3926226 0.3936026 0.3892513 0.3843979 0.3821444 0.3792757 0.3895821 0.3839527 0.3896506 0.3873657
rmsle 0.2715834 0.0024911 0.2745020 0.2751261 0.2727129 0.2700233 0.2689685 0.2673593 0.2726018 0.2697902 0.2730406 0.2717092
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 23:30:18 5 min 39.854 sec 0.0 nan nan nan
2023-07-26 23:30:18 5 min 39.953 sec 1.0 0.3899816 0.2991295 0.1520856
2023-07-26 23:30:18 5 min 40.058 sec 2.0 0.3913598 0.3006538 0.1531625
2023-07-26 23:30:18 5 min 40.148 sec 3.0 0.3896600 0.3000131 0.1518349
2023-07-26 23:30:18 5 min 40.247 sec 4.0 0.3890223 0.2999989 0.1513384
2023-07-26 23:30:18 5 min 40.352 sec 5.0 0.3891703 0.3000873 0.1514535
2023-07-26 23:30:18 5 min 40.453 sec 6.0 0.3886908 0.3002141 0.1510805
2023-07-26 23:30:18 5 min 40.550 sec 7.0 0.3884296 0.3002264 0.1508775
2023-07-26 23:30:18 5 min 40.649 sec 8.0 0.3882854 0.3001559 0.1507655
2023-07-26 23:30:19 5 min 40.748 sec 9.0 0.3880043 0.3000113 0.1505474
--- --- --- --- --- --- ---
2023-07-26 23:30:21 5 min 43.472 sec 35.0 0.3872759 0.3002579 0.1499827
2023-07-26 23:30:21 5 min 43.643 sec 36.0 0.3872822 0.3002692 0.1499875
2023-07-26 23:30:22 5 min 43.811 sec 37.0 0.3872803 0.3002645 0.1499861
2023-07-26 23:30:26 5 min 47.869 sec 76.0 0.3871462 0.3001187 0.1498822
2023-07-26 23:30:30 5 min 51.936 sec 120.0 0.3871379 0.3000850 0.1498757
2023-07-26 23:30:34 5 min 56.004 sec 160.0 0.3871235 0.3000569 0.1498646
2023-07-26 23:30:38 6 min 0.010 sec 200.0 0.3871409 0.3000359 0.1498781
2023-07-26 23:30:42 6 min 4.102 sec 244.0 0.3871412 0.3000565 0.1498783
2023-07-26 23:30:46 6 min 8.123 sec 282.0 0.3871323 0.3000268 0.1498714
2023-07-26 23:30:48 6 min 9.983 sec 300.0 0.3871329 0.3000328 0.1498719
[45 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 21400.7890625 1.0 0.1963187
TD005_WOE 18484.5566406 0.8637325 0.1695668
TD014_bin_WOE 10382.4628906 0.4851439 0.0952428
AP003_bin_WOE 9848.0214844 0.4601710 0.0903402
CR015_bin_WOE 8241.4550781 0.3851005 0.0756024
AP008_WOE 5805.2836914 0.2712649 0.0532544
CR019_WOE 5420.5742188 0.2532885 0.0497253
PA029_bin_WOE 5250.4301758 0.2453382 0.0481645
TD010_bin_WOE 4775.7114258 0.2231559 0.0438097
PA022_bin_WOE 4665.4946289 0.2180057 0.0427986
AP001_WOE 4154.1879883 0.1941138 0.0381082
TD001_bin_WOE 3647.7722168 0.1704504 0.0334626
PA023_bin_WOE 3111.0019531 0.1453686 0.0285386
CR009_bin_WOE 2362.7492676 0.1104048 0.0216745
TD006_bin_WOE 1459.9486084 0.0682194 0.0133927

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
ROC_AUC(rf_v2,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

In [ ]:
createGains(rf_v2)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 1600 509 1091 1600 509 1091 0.16 0.08 315.0 1.62 8.0 31.81
1 1600 440 1160 3200 949 2251 0.30 0.18 630.0 1.51 12.0 29.66
2 1600 362 1238 4800 1311 3489 0.42 0.27 945.0 1.39 15.0 27.31
3 1600 368 1232 6400 1679 4721 0.53 0.37 1260.0 1.33 16.0 26.23
4 1600 326 1274 8000 2005 5995 0.64 0.47 1575.0 1.27 17.0 25.06
5 1600 237 1363 9600 2242 7358 0.71 0.57 1890.0 1.19 14.0 23.35
6 1600 239 1361 11200 2481 8719 0.79 0.68 2205.0 1.13 11.0 22.15
7 1600 263 1337 12800 2744 10056 0.87 0.78 2520.0 1.09 9.0 21.44
8 1600 233 1367 14400 2977 11423 0.95 0.89 2835.0 1.05 6.0 20.67
9 1600 173 1427 16000 3150 12850 1.00 1.00 3150.0 1.00 0.0 19.69

Use H2O's "balance_classes"¶

  • The balance_classes option can be used to balance the class distribution. When enabled, H2O will either undersample the majority classes or oversample the minority classes.
  • Note that the resulting model will also correct the final probabilities (“undo the sampling”) using a monotonic transform, so the predicted probabilities of the first model will differ from a second model. However, because AUC only cares about ordering, it won’t be affected.
  • See this H2O page.
In [ ]:
train_smpl = train_df_rf.sample(frac=0.1, random_state=1)
test_smpl = test_df_rf.sample(frac=0.1, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
rf_v3 = H2ORandomForestEstimator(
        model_id = 'rf_v3',
        ntrees = 300,
        nfolds=10,
        min_rows=100,
        balance_classes = True,
        seed=1234)
rf_v3.train(predictors,target,training_frame=train_hex)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: rf_v3
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
300.0 300.0 125323.0 7.0 12.0 8.963333 24.0 32.0 28.423334
ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.14951065232172528
RMSE: 0.3866660734040747
MAE: 0.2992474803933356
RMSLE: 0.2712098866294993
Mean Residual Deviance: 0.14951065232172528
ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.14963480040402838
RMSE: 0.3868265766516416
MAE: 0.299352536320045
RMSLE: 0.27131915462305656
Mean Residual Deviance: 0.14963480040402838
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
mae 0.2993906 0.0082924 0.3132309 0.2944414 0.3019102 0.2925448 0.3000576 0.3052748 0.2901710 0.3090672 0.2877329 0.2994755
mean_residual_deviance 0.1496330 0.0094428 0.1652384 0.1433392 0.1545313 0.1437271 0.1504920 0.1577827 0.1377648 0.1591601 0.1371658 0.1471285
mse 0.1496330 0.0094428 0.1652384 0.1433392 0.1545313 0.1437271 0.1504920 0.1577827 0.1377648 0.1591601 0.1371658 0.1471285
r2 0.0231276 0.0151769 0.0032182 0.0241438 0.0142291 0.0122233 0.0187475 0.0222429 0.0333000 0.0565313 0.0124066 0.0342336
residual_deviance 0.1496330 0.0094428 0.1652384 0.1433392 0.1545313 0.1437271 0.1504920 0.1577827 0.1377648 0.1591601 0.1371658 0.1471285
rmse 0.3866515 0.0121850 0.4064952 0.3786016 0.3931047 0.3791136 0.3879329 0.3972187 0.3711669 0.3989487 0.3703589 0.3835733
rmsle 0.2712443 0.0065110 0.2830140 0.2670296 0.2748869 0.2674065 0.2722500 0.2767344 0.2627884 0.2759479 0.2629195 0.2694657
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 23:31:26 33.817 sec 0.0 nan nan nan
2023-07-26 23:31:26 33.827 sec 1.0 0.3875406 0.2971530 0.1501877
2023-07-26 23:31:26 33.836 sec 2.0 0.3890889 0.3008566 0.1513902
2023-07-26 23:31:26 33.844 sec 3.0 0.3882861 0.2995615 0.1507661
2023-07-26 23:31:26 33.853 sec 4.0 0.3869131 0.2988633 0.1497017
2023-07-26 23:31:26 33.863 sec 5.0 0.3873920 0.3000798 0.1500726
2023-07-26 23:31:26 33.872 sec 6.0 0.3873828 0.2999194 0.1500654
2023-07-26 23:31:26 33.880 sec 7.0 0.3879887 0.3004147 0.1505352
2023-07-26 23:31:26 33.889 sec 8.0 0.3881241 0.3000204 0.1506403
2023-07-26 23:31:26 33.896 sec 9.0 0.3876442 0.3000743 0.1502680
--- --- --- --- --- --- ---
2023-07-26 23:31:29 36.777 sec 291.0 0.3866741 0.2992291 0.1495169
2023-07-26 23:31:29 36.786 sec 292.0 0.3866675 0.2992232 0.1495117
2023-07-26 23:31:29 36.795 sec 293.0 0.3866678 0.2992219 0.1495120
2023-07-26 23:31:29 36.804 sec 294.0 0.3866633 0.2992180 0.1495085
2023-07-26 23:31:29 36.819 sec 295.0 0.3866660 0.2992240 0.1495106
2023-07-26 23:31:29 36.828 sec 296.0 0.3866689 0.2992322 0.1495128
2023-07-26 23:31:29 36.837 sec 297.0 0.3866656 0.2992397 0.1495103
2023-07-26 23:31:29 36.848 sec 298.0 0.3866640 0.2992462 0.1495091
2023-07-26 23:31:29 36.856 sec 299.0 0.3866674 0.2992554 0.1495117
2023-07-26 23:31:29 36.865 sec 300.0 0.3866661 0.2992475 0.1495107
[301 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 2103.5937500 1.0 0.2383542
TD005_WOE 1458.4207764 0.6932996 0.1652509
PA029_bin_WOE 1220.0726318 0.5799944 0.1382441
TD014_bin_WOE 594.2836304 0.2825087 0.0673371
CR019_WOE 551.7902832 0.2623084 0.0625223
CR015_bin_WOE 534.6726685 0.2541711 0.0605827
PA023_bin_WOE 413.6450500 0.1966373 0.0468693
AP003_bin_WOE 331.2231140 0.1574558 0.0375303
AP001_WOE 304.8792114 0.1449326 0.0345453
AP008_WOE 283.7080994 0.1348683 0.0321464
TD010_bin_WOE 267.6243286 0.1272224 0.0303240
TD001_bin_WOE 251.5751953 0.1195931 0.0285055
CR009_bin_WOE 246.8613586 0.1173522 0.0279714
PA022_bin_WOE 185.5974274 0.0882287 0.0210297
TD006_bin_WOE 77.5485992 0.0368648 0.0087869

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
ROC_AUC(rf_v3,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

In [ ]:
createGains(rf_v3)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 160 47 113 160 47 113 0.16 0.09 30.0 1.57 7.0 29.38
1 160 42 118 320 89 231 0.30 0.18 60.0 1.48 12.0 27.81
2 160 41 119 480 130 350 0.43 0.27 90.0 1.44 16.0 27.08
3 160 35 125 640 165 475 0.55 0.37 120.0 1.38 18.0 25.78
4 160 31 129 800 196 604 0.65 0.46 150.0 1.31 19.0 24.50
5 160 24 136 960 220 740 0.73 0.57 180.0 1.22 16.0 22.92
6 160 23 137 1120 243 877 0.81 0.67 210.0 1.16 14.0 21.70
7 160 16 144 1280 259 1021 0.86 0.79 240.0 1.08 7.0 20.23
8 160 23 137 1440 282 1158 0.94 0.89 270.0 1.04 5.0 19.58
9 160 18 142 1600 300 1300 1.00 1.00 300.0 1.00 0.0 18.75

Undersampling¶

  • Undersampling is a technique used to tackle class imbalance in a dataset. It involves reducing the representation of the majority class by randomly removing instances from the majority class until the desired balance between classes is achieved. By reducing the number of instances of the majority class, undersampling helps prevent the model from being overwhelmed by the dominant class and focuses on learning from the minority class instances.
  • However, undersampling might lead to a loss of potentially useful information from the majority class, which could impact the model's overall performance.
  • Using Under-Sampling Techniques for Extremely Imbalanced Data
  • imblearn
In [ ]:
#Concatenate along rows (vertically)
#data_undersample = pd.concat([train_df_rf, test_df_rf])
#data_undersample = data_undersample.sort_values(by='id', ascending=True)
#data_undersample
In [ ]:
y = train_df_rf[target]
X = train_df_rf.drop(target,axis=1)
y.dtypes
Out[ ]:
dtype('int64')
In [ ]:
y1_cnt = train_df_rf[target].sum()
y1_cnt
Out[ ]:
12338
In [ ]:
N = 2
y0_cnt = y1_cnt * N
y0_cnt
Out[ ]:
24676
In [ ]:
pip install imblearn
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.10.1)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.22.4)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.10.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.3.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.2.0)
Installing collected packages: imblearn
Successfully installed imblearn-0.0
In [ ]:
from imblearn.datasets import make_imbalance
X_rs, y_rs = make_imbalance(X, y,
                            sampling_strategy={1:y1_cnt , 0:  y0_cnt},
                            random_state=0)
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
In [ ]:
y_rs = train_df_rf[train_df_rf[target]==1]
X_rs = train_df_rf[train_df_rf[target]==0].sample(n=y0_cnt)
smpl = pd.concat([X_rs,y_rs])
smpl_hex = h2o.H2OFrame(smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
rf_v4 = H2ORandomForestEstimator(
        model_id = 'rf_v4',
        ntrees = 300,
        nfolds=10,
        min_rows=100,
        seed=1234)
rf_v4.train(predictors,target,training_frame=smpl_hex)
#train with the upsampled smpl_hex as the training frame
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: rf_v4
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
300.0 300.0 631596.0 12.0 18.0 14.2 149.0 174.0 163.03
ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.21077186366658382
RMSE: 0.4590989693590956
MAE: 0.4228487633192769
RMSLE: 0.3227436376803513
Mean Residual Deviance: 0.21077186366658382
ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.21079706546892607
RMSE: 0.4591264155643041
MAE: 0.4230214030525104
RMSLE: 0.32274926426018896
Mean Residual Deviance: 0.21079706546892607
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
mae 0.4230262 0.0016552 0.4242583 0.4226995 0.4232406 0.4205764 0.4217563 0.4217314 0.4248930 0.4236615 0.4258246 0.4216206
mean_residual_deviance 0.2108000 0.0015687 0.2116346 0.2108016 0.2111888 0.2086496 0.2087994 0.2098977 0.2120005 0.2121581 0.2134057 0.2094645
mse 0.2108000 0.0015687 0.2116346 0.2108016 0.2111888 0.2086496 0.2087994 0.2098977 0.2120005 0.2121581 0.2134057 0.2094645
r2 0.0512652 0.0054807 0.0512642 0.0533489 0.0554153 0.0419659 0.0552363 0.0527978 0.0554222 0.0512500 0.0407212 0.0552302
residual_deviance 0.2108000 0.0015687 0.2116346 0.2108016 0.2111888 0.2086496 0.2087994 0.2098977 0.2120005 0.2121581 0.2134057 0.2094645
rmse 0.4591268 0.0017082 0.4600376 0.4591313 0.4595528 0.4567817 0.4569457 0.4581459 0.4604351 0.4606062 0.4619585 0.4576729
rmsle 0.3227511 0.0008634 0.3232280 0.32237 0.3225686 0.3228075 0.3220436 0.3219627 0.3228393 0.3230383 0.3248140 0.3218389
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 23:34:15 2 min 38.652 sec 0.0 nan nan nan
2023-07-26 23:34:15 2 min 38.703 sec 1.0 0.4645581 0.4238703 0.2158143
2023-07-26 23:34:15 2 min 38.747 sec 2.0 0.4622949 0.4223424 0.2137165
2023-07-26 23:34:15 2 min 38.797 sec 3.0 0.4623465 0.4230279 0.2137643
2023-07-26 23:34:15 2 min 38.844 sec 4.0 0.4627498 0.4236462 0.2141374
2023-07-26 23:34:15 2 min 38.891 sec 5.0 0.4622605 0.4230023 0.2136848
2023-07-26 23:34:15 2 min 38.938 sec 6.0 0.4615715 0.4228065 0.2130483
2023-07-26 23:34:15 2 min 38.981 sec 7.0 0.4611671 0.4228217 0.2126751
2023-07-26 23:34:15 2 min 39.023 sec 8.0 0.4608365 0.4227504 0.2123702
2023-07-26 23:34:15 2 min 39.068 sec 9.0 0.4606891 0.4227239 0.2122345
--- --- --- --- --- --- ---
2023-07-26 23:34:19 2 min 42.335 sec 81.0 0.4592008 0.4228961 0.2108654
2023-07-26 23:34:19 2 min 42.382 sec 82.0 0.4591884 0.4228797 0.2108540
2023-07-26 23:34:19 2 min 42.425 sec 83.0 0.4591783 0.4228581 0.2108448
2023-07-26 23:34:19 2 min 42.467 sec 84.0 0.4591619 0.4228475 0.2108296
2023-07-26 23:34:19 2 min 42.520 sec 85.0 0.4591572 0.4228333 0.2108253
2023-07-26 23:34:19 2 min 42.566 sec 86.0 0.4591530 0.4228493 0.2108215
2023-07-26 23:34:19 2 min 42.611 sec 87.0 0.4591583 0.4228466 0.2108264
2023-07-26 23:34:23 2 min 46.640 sec 162.0 0.4591054 0.4228300 0.2107778
2023-07-26 23:34:27 2 min 50.647 sec 255.0 0.4590985 0.4228171 0.2107714
2023-07-26 23:34:29 2 min 52.567 sec 300.0 0.4590990 0.4228488 0.2107719
[91 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 24682.8769531 1.0 0.2158746
TD005_WOE 19699.0332031 0.7980850 0.1722863
TD014_bin_WOE 11587.6376953 0.4694606 0.1013446
AP003_bin_WOE 10638.4736328 0.4310062 0.0930433
CR015_bin_WOE 9107.7890625 0.3689922 0.0796560
PA029_bin_WOE 5926.0659180 0.2400881 0.0518289
TD010_bin_WOE 5356.1132812 0.2169971 0.0468442
AP008_WOE 5094.9741211 0.2064174 0.0445603
PA022_bin_WOE 4758.9101562 0.1928021 0.0416211
CR019_WOE 4499.7324219 0.1823018 0.0393543
TD001_bin_WOE 3518.0322266 0.1425293 0.0307684
AP001_WOE 3328.6904297 0.1348583 0.0291125
PA023_bin_WOE 3035.7680664 0.1229909 0.0265506
CR009_bin_WOE 1846.0827637 0.0747920 0.0161457
TD006_bin_WOE 1258.7723389 0.0509978 0.0110091

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
ROC_AUC(rf_v4,smpl_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

In [ ]:
ROC_AUC(rf_v4,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

In [ ]:
createGains(rf_v4)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 160 47 113 160 47 113 0.16 0.09 30.0 1.57 7.0 29.38
1 160 48 112 320 95 225 0.32 0.17 60.0 1.58 15.0 29.69
2 160 40 120 480 135 345 0.45 0.27 90.0 1.50 18.0 28.12
3 160 37 123 640 172 468 0.57 0.36 120.0 1.43 21.0 26.88
4 160 25 135 800 197 603 0.66 0.46 150.0 1.31 20.0 24.62
5 160 26 134 960 223 737 0.74 0.57 180.0 1.24 17.0 23.23
6 160 16 144 1120 239 881 0.80 0.68 210.0 1.14 12.0 21.34
7 160 19 141 1280 258 1022 0.86 0.79 240.0 1.08 7.0 20.16
8 160 26 134 1440 284 1156 0.95 0.89 270.0 1.05 6.0 19.72
9 160 16 144 1600 300 1300 1.00 1.00 300.0 1.00 0.0 18.75

Oversampling¶

  • Oversampling is a technique used in machine learning to address class imbalance in a dataset.
  • It aims to improve model performance by providing more training data for the underrepresented class, thus reducing bias and enabling the model to better capture the characteristics of the minority class.
In [ ]:
from imblearn.over_sampling import RandomOverSampler
In [ ]:
# Assuming you have a DataFrame train_df_rf with your training data
target = 'loan_default'
X = train_df_rf.drop(target, axis=1)
y = train_df_rf[target]
# Instantiate the RandomOverSampler
ros = RandomOverSampler(random_state=0)
# Perform the Random Over-Sampling on the data
X_ros, y_ros = ros.fit_resample(X, y)

X_ros = pd.DataFrame(X_ros)
y_ros = pd.DataFrame(y_ros)

y_ros = train_df_rf[train_df_rf[target]==1]
X_ros = train_df_rf[train_df_rf[target]==0].sample(n=y0_cnt)
smpl2 = pd.concat([X_ros,y_ros])
smpl_hex2 = h2o.H2OFrame(smpl2)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
rf_v5 = H2ORandomForestEstimator(
        model_id = 'rf_v5',
        ntrees = 300,
        nfolds=10,
        min_rows=100,
        seed=1234)
rf_v5.train(predictors,target,training_frame=smpl_hex2)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: rf_v5
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
300.0 300.0 631419.0 12.0 18.0 14.646667 153.0 173.0 163.04
ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.2108294692109558
RMSE: 0.45916170268322226
MAE: 0.4230355253292622
RMSLE: 0.32276222801680743
Mean Residual Deviance: 0.2108294692109558
ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.21091192465924377
RMSE: 0.4592514830234561
MAE: 0.4232599539533539
RMSLE: 0.3228381427219545
Mean Residual Deviance: 0.21091192465924377
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
mae 0.4232679 0.0016838 0.4253459 0.4241945 0.4224509 0.4198744 0.4224276 0.4250855 0.4217607 0.4231897 0.4245106 0.4238390
mean_residual_deviance 0.2109185 0.0015538 0.2124475 0.2120962 0.2102032 0.2082453 0.2090172 0.2123263 0.2099796 0.2115506 0.2127480 0.2105706
mse 0.2109185 0.0015538 0.2124475 0.2120962 0.2102032 0.2082453 0.2090172 0.2123263 0.2099796 0.2115506 0.2127480 0.2105706
r2 0.0507201 0.0073772 0.0476200 0.0475348 0.0598237 0.0438222 0.0542511 0.0418383 0.0644265 0.0539665 0.0436772 0.0502408
residual_deviance 0.2109185 0.0015538 0.2124475 0.2120962 0.2102032 0.2082453 0.2090172 0.2123263 0.2099796 0.2115506 0.2127480 0.2105706
rmse 0.4592558 0.0016929 0.4609203 0.4605391 0.4584792 0.4563390 0.4571839 0.4607888 0.4582353 0.4599463 0.4612462 0.4588798
rmsle 0.3228421 0.0011548 0.3240515 0.3234761 0.3216558 0.3223401 0.3222780 0.3242907 0.3207237 0.3225305 0.3240419 0.3230334
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 23:37:30 2 min 50.855 sec 0.0 nan nan nan
2023-07-26 23:37:30 2 min 50.982 sec 1.0 0.4634545 0.4230118 0.2147901
2023-07-26 23:37:30 2 min 51.102 sec 2.0 0.4629596 0.4226218 0.2143316
2023-07-26 23:37:30 2 min 51.180 sec 3.0 0.4622660 0.4228333 0.2136898
2023-07-26 23:37:31 2 min 51.239 sec 4.0 0.4629738 0.4237793 0.2143448
2023-07-26 23:37:31 2 min 51.311 sec 5.0 0.4626122 0.4233074 0.2140101
2023-07-26 23:37:31 2 min 51.379 sec 6.0 0.4620551 0.4232934 0.2134949
2023-07-26 23:37:31 2 min 51.446 sec 7.0 0.4615999 0.4233006 0.2130745
2023-07-26 23:37:31 2 min 51.498 sec 8.0 0.4613335 0.4233212 0.2128286
2023-07-26 23:37:31 2 min 51.552 sec 9.0 0.4610778 0.4232437 0.2125928
--- --- --- --- --- --- ---
2023-07-26 23:37:34 2 min 54.613 sec 58.0 0.4593532 0.4231081 0.2110054
2023-07-26 23:37:34 2 min 54.675 sec 59.0 0.4593352 0.4230896 0.2109888
2023-07-26 23:37:34 2 min 54.725 sec 60.0 0.4593225 0.4230810 0.2109772
2023-07-26 23:37:34 2 min 54.768 sec 61.0 0.4592950 0.4230651 0.2109519
2023-07-26 23:37:34 2 min 54.811 sec 62.0 0.4593009 0.4230671 0.2109573
2023-07-26 23:37:34 2 min 54.855 sec 63.0 0.4592915 0.4230444 0.2109487
2023-07-26 23:37:38 2 min 58.891 sec 155.0 0.4591898 0.4230735 0.2108553
2023-07-26 23:37:42 3 min 2.910 sec 246.0 0.4591784 0.4229916 0.2108448
2023-07-26 23:37:46 3 min 6.914 sec 299.0 0.4591607 0.4230368 0.2108286
2023-07-26 23:37:46 3 min 6.993 sec 300.0 0.4591617 0.4230355 0.2108295
[68 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 24170.1152344 1.0 0.2141307
TD005_WOE 19442.2890625 0.8043937 0.1722454
TD014_bin_WOE 11670.7021484 0.4828567 0.1033945
AP003_bin_WOE 10944.4785156 0.4528104 0.0969606
CR015_bin_WOE 8672.9453125 0.3588293 0.0768364
PA022_bin_WOE 5423.5327148 0.2243900 0.0480488
AP008_WOE 5042.3710938 0.2086201 0.0446720
TD010_bin_WOE 4650.3427734 0.1924005 0.0411989
TD001_bin_WOE 4553.0815430 0.1883765 0.0403372
CR019_WOE 4020.8261719 0.1663553 0.0356218
PA029_bin_WOE 3881.5087891 0.1605912 0.0343875
PA023_bin_WOE 3716.6398926 0.1537701 0.0329269
AP001_WOE 3558.7382812 0.1472371 0.0315280
CR009_bin_WOE 2015.2391357 0.0833773 0.0178536
TD006_bin_WOE 1112.7031250 0.0460363 0.0098578

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
ROC_AUC(rf_v5,smpl_hex2,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

In [ ]:
ROC_AUC(rf_v5,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Insights:¶

  • Both the ROC-AUC metric and PR (average precision) indicates that the model performs the best after undersampling (rf_v4), followed by the model with oversampling (rf_v5), and lastly, the model that has addressed class imbalance in the dataset (rf_v3). Notably, they all perform better than decision tree model using the same set of feature.
  • It seems that under-sampling/over-sampling techniques do improve the predictability.
  • However, the AUCs of all the models are very close, suggesting similar discrimination abilities among the models.
  • Possible reasons:

(1) If the feature set doesn't contain strong discriminatory information for the minority class, balancing the data alone might not lead to substantial improvements. (2) If the original dataset is well-balanced, representative, and contains sufficient information for the classifier to learn, then both oversampling and undersampling might not have a substantial impact on the model's performance.

Next Steps:¶

  • Explore ways to improve the model's performance. This can involve feature selection, testing different sets of feature, and conducting hyperparameter tuning using techniques like grid search, random search, or Bayesian optimization to find the best combination of hyperparameters that yield the highest model performance.

Section 4 Decision Tree ¶

In [ ]:
from sklearn.tree import DecisionTreeClassifier # for classification
from sklearn.tree import DecisionTreeRegressor # for regression

# First, specify the model
dtree = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = 6)
In [ ]:
# Then, train the model.
dtree.fit(train_df_WOE_withoutid,train_df.target)
Out[ ]:
DecisionTreeClassifier(max_depth=6, min_samples_leaf=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, min_samples_leaf=5)
In [ ]:
features = ['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
In [ ]:
predictions = dtree.predict(test_df_WOE_withoutid[features])
predictions
Out[ ]:
array([0, 0, 0, ..., 0, 0, 0])
In [ ]:
dtree.predict_proba(test_df_WOE_withoutid[features])
Out[ ]:
array([[0.85357873, 0.14642127],
       [0.7785124 , 0.2214876 ],
       [0.95964126, 0.04035874],
       ...,
       [0.85357873, 0.14642127],
       [0.86291827, 0.13708173],
       [0.74327628, 0.25672372]])
In [ ]:
y_pred = dtree.predict_proba(test_df_WOE_withoutid[features])[:,1]
In [ ]:
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, confusion_matrix

roc_auc_value = roc_auc_score(test_df.target,y_pred)
roc_auc_value
Out[ ]:
0.598508109443518
In [ ]:
fpr, tpr, _ = roc_curve(test_df.target,y_pred)
[fpr,tpr]
Out[ ]:
[array([0.00000000e+00, 5.44747082e-04, 1.08949416e-03, 1.16731518e-03,
        9.49416342e-03, 1.06614786e-02, 1.15953307e-02, 1.27626459e-02,
        2.87937743e-02, 3.19066148e-02, 4.38910506e-02, 4.69260700e-02,
        4.87937743e-02, 6.07003891e-02, 9.06614786e-02, 9.53307393e-02,
        1.57821012e-01, 1.94941634e-01, 2.06536965e-01, 2.27003891e-01,
        2.34007782e-01, 2.39922179e-01, 2.60778210e-01, 3.08404669e-01,
        3.28871595e-01, 3.41400778e-01, 3.46381323e-01, 3.83424125e-01,
        3.83424125e-01, 4.29260700e-01, 4.49105058e-01, 4.51517510e-01,
        4.55642023e-01, 4.64980545e-01, 4.66147860e-01, 4.81478599e-01,
        5.48560311e-01, 5.50350195e-01, 5.53307393e-01, 5.59299611e-01,
        5.98287938e-01, 6.05214008e-01, 6.63891051e-01, 6.78832685e-01,
        6.83968872e-01, 7.88949416e-01, 7.96342412e-01, 8.03657588e-01,
        8.29571984e-01, 8.39844358e-01, 8.50583658e-01, 8.57976654e-01,
        9.50894942e-01, 9.52217899e-01, 9.58521401e-01, 9.81556420e-01,
        9.85291829e-01, 9.91361868e-01, 9.94863813e-01, 9.96342412e-01,
        9.99610895e-01, 1.00000000e+00]),
 array([0.00000000e+00, 3.17460317e-04, 6.34920635e-04, 6.34920635e-04,
        1.39682540e-02, 1.49206349e-02, 1.52380952e-02, 1.61904762e-02,
        5.17460317e-02, 5.61904762e-02, 6.98412698e-02, 7.36507937e-02,
        7.74603175e-02, 1.10158730e-01, 1.68571429e-01, 1.73015873e-01,
        2.41904762e-01, 2.93015873e-01, 3.06666667e-01, 3.37460317e-01,
        3.49206349e-01, 3.56190476e-01, 3.98730159e-01, 4.64761905e-01,
        4.94920635e-01, 5.06349206e-01, 5.12063492e-01, 5.52380952e-01,
        5.52698413e-01, 5.82222222e-01, 6.00317460e-01, 6.01269841e-01,
        6.03809524e-01, 6.08253968e-01, 6.11428571e-01, 6.28888889e-01,
        6.65079365e-01, 6.66349206e-01, 6.66984127e-01, 6.71111111e-01,
        7.19047619e-01, 7.25714286e-01, 7.83809524e-01, 7.95238095e-01,
        7.97142857e-01, 8.73650794e-01, 8.81269841e-01, 8.85396825e-01,
        8.99047619e-01, 9.01269841e-01, 9.08888889e-01, 9.12063492e-01,
        9.78095238e-01, 9.80000000e-01, 9.82222222e-01, 9.89841270e-01,
        9.93015873e-01, 9.96825397e-01, 9.97777778e-01, 9.99047619e-01,
        9.99682540e-01, 1.00000000e+00])]
In [ ]:
import matplotlib.pyplot as plt
lw=2
plt.figure(figsize=(6,4))
plt.plot(fpr,tpr, color='darkorange',lw=lw,label='ROC curve (area = %0.2f)' %roc_auc_value)
plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.show()

Interpretation of ROC and AUC¶

  • ROC shows the diagnostic ability of binary classifiers, the closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
  • In general, an AUC of 0.5 suggests no discrimination, an Area Under the Curve (AUC) of 0.6 indicates a moderate level of predictive power. The model's ability to distinguish between positive and negative instances is slightly better than random chance, but there is room for improvement.

Next Steps:¶

  • Explore ways to improve the model's performance. This can involve feature engineering, considering additional relevant variables, adjusting model parameters, or using more advanced modeling techniques. Iteratively refining the model and evaluating its impact on the ROC curve and AUC can help improve its performance.

Section 5 GBM ¶

What's Gradient Boosting Machine (GBM)?¶

Gradient Boosting Machine is a forward learning ensemble method. It is a powerful and popular machine learning algorithm used for both regression and classification tasks. It works by combining multiple weak learners, typically decision trees, in an iterative manner. Each subsequent tree corrects the errors made by the previous ones, gradually improving the model's predictive accuracy. GBM optimizes a loss function using gradient descent to find the best possible ensemble of trees.

H2O's GBM is an implementation of the GBM algorithm designed for high-performance and scalability. It offers various tuning parameters and options for model customization, making it a preferred choice for many data scientists and engineers dealing with big data scenarios.

In [ ]:
train_df_gbm = train_df_rf
test_df_gbm = test_df_rf
In [ ]:
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_gbm.columns.tolist()
predictors=predictors[2:17]
predictors
Out[ ]:
['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR009_bin_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD006_bin_WOE',
 'TD009_bin_WOE',
 'TD010_bin_WOE',
 'TD014_bin_WOE']
In [ ]:
#Use 50% training data
train_smpl = train_df_gbm.sample(frac=0.5, random_state=1)
test_smpl = test_df_gbm.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
gbm_v1 = H2OGradientBoostingEstimator(
        model_id = 'gbm_v1',
        seed=1234)
gbm_v1.train(predictors,target,training_frame=train_hex)
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2OGradientBoostingEstimator : Gradient Boosting Machine
Model Key: gbm_v1
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
50.0 50.0 22251.0 5.0 5.0 5.0 24.0 32.0 30.72
ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.144712223309159
RMSE: 0.3804105983134001
MAE: 0.29368577951529734
RMSLE: 0.2663937423135674
Mean Residual Deviance: 0.144712223309159
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 01:00:03 0.350 sec 0.0 0.3950172 0.3120772 0.1560386
2023-07-26 01:00:04 1.243 sec 1.0 0.3934904 0.3108099 0.1548347
2023-07-26 01:00:05 1.560 sec 2.0 0.3922290 0.3096628 0.1538436
2023-07-26 01:00:05 1.754 sec 3.0 0.3911679 0.3086323 0.1530123
2023-07-26 01:00:05 1.940 sec 4.0 0.3902702 0.3076790 0.1523109
2023-07-26 01:00:05 2.092 sec 5.0 0.3894680 0.3067850 0.1516854
2023-07-26 01:00:05 2.256 sec 6.0 0.3888119 0.3059794 0.1511747
2023-07-26 01:00:05 2.405 sec 7.0 0.3882096 0.3052193 0.1507067
2023-07-26 01:00:06 2.573 sec 8.0 0.3876805 0.3045070 0.1502961
2023-07-26 01:00:06 2.711 sec 9.0 0.3872399 0.3038831 0.1499547
--- --- --- --- --- --- ---
2023-07-26 01:00:07 3.526 sec 15.0 0.3853087 0.3008516 0.1484628
2023-07-26 01:00:07 3.653 sec 16.0 0.3850870 0.3004736 0.1482920
2023-07-26 01:00:07 3.760 sec 17.0 0.3848541 0.3000817 0.1481127
2023-07-26 01:00:07 3.864 sec 18.0 0.3846232 0.2996997 0.1479350
2023-07-26 01:00:07 3.994 sec 19.0 0.3844344 0.2993804 0.1477898
2023-07-26 01:00:07 4.101 sec 20.0 0.3842524 0.2990728 0.1476499
2023-07-26 01:00:07 4.209 sec 21.0 0.3840773 0.2988061 0.1475154
2023-07-26 01:00:07 4.287 sec 22.0 0.3839034 0.2985029 0.1473818
2023-07-26 01:00:07 4.375 sec 23.0 0.3837294 0.2982259 0.1472483
2023-07-26 01:00:10 7.187 sec 50.0 0.3804106 0.2936858 0.1447122
[25 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 375.0860901 1.0 0.1963713
TD005_WOE 277.0020447 0.7385026 0.1450207
AP003_bin_WOE 199.6016083 0.5321488 0.1044987
CR015_bin_WOE 165.0644226 0.4400708 0.0864173
AP001_WOE 145.4165802 0.3876885 0.0761309
CR019_WOE 144.0874481 0.3841450 0.0754350
TD014_bin_WOE 116.8752594 0.3115958 0.0611885
AP008_WOE 104.1815643 0.2777537 0.0545429
PA023_bin_WOE 92.1280365 0.2456184 0.0482324
PA029_bin_WOE 73.7319412 0.1965734 0.0386014
TD001_bin_WOE 66.4085541 0.1770488 0.0347673
CR009_bin_WOE 52.7706642 0.1406895 0.0276274
PA022_bin_WOE 46.0381508 0.1227402 0.0241027
TD010_bin_WOE 36.5775566 0.0975178 0.0191497
TD006_bin_WOE 15.1164932 0.0403014 0.0079140

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
VarImp(gbm_v1)
In [ ]:
createGains(gbm_v1)
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 1600 468 1132 1600 468 1132 0.15 0.09 315.0 1.49 6.0 29.25
1 1600 427 1173 3200 895 2305 0.28 0.18 630.0 1.42 10.0 27.97
2 1600 383 1217 4800 1278 3522 0.41 0.27 945.0 1.35 14.0 26.62
3 1600 361 1239 6400 1639 4761 0.52 0.37 1260.0 1.30 15.0 25.61
4 1600 315 1285 8000 1954 6046 0.62 0.47 1575.0 1.24 15.0 24.42
5 1600 259 1341 9600 2213 7387 0.70 0.57 1890.0 1.17 13.0 23.05
6 1600 241 1359 11200 2454 8746 0.78 0.68 2205.0 1.11 10.0 21.91
7 1600 257 1343 12800 2711 10089 0.86 0.79 2520.0 1.08 7.0 21.18
8 1600 251 1349 14400 2962 11438 0.94 0.89 2835.0 1.04 5.0 20.57
9 1600 188 1412 16000 3150 12850 1.00 1.00 3150.0 1.00 0.0 19.69
In [ ]:
ROC_AUC(gbm_v1,test_hex,'loan_default')
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Try Dropping 2 Less Important Features¶

Truns out using all transformed features achieves best performance¶

In [ ]:
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_gbm.columns.tolist()
predictors=predictors[2:17]
values_to_remove = ['TD006_bin_WOE', 'TD010_bin_WOE']
predictors = [item for item in predictors if item not in values_to_remove]
predictors
Out[ ]:
['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR009_bin_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD009_bin_WOE',
 'TD014_bin_WOE']
In [ ]:
gbm_v2 = H2OGradientBoostingEstimator(
        model_id = 'gbm_v1',
        seed=1234)
gbm_v2.train(predictors,target,training_frame=train_hex)
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2OGradientBoostingEstimator : Gradient Boosting Machine
Model Key: gbm_v1
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
50.0 50.0 22361.0 5.0 5.0 5.0 26.0 32.0 30.9
ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.14693649700108505
RMSE: 0.3833229669626972
MAE: 0.29628467947138415
RMSLE: 0.2686390871878489
Mean Residual Deviance: 0.14693649700108505
Scoring History:
timestamp duration number_of_trees training_rmse training_mae training_deviance
2023-07-26 01:29:24 0.019 sec 0.0 0.3944827 0.3112333 0.1556166
2023-07-26 01:29:25 0.160 sec 1.0 0.3931736 0.3101478 0.1545855
2023-07-26 01:29:25 0.268 sec 2.0 0.3921000 0.3091753 0.1537424
2023-07-26 01:29:25 0.388 sec 3.0 0.3912023 0.3082834 0.1530392
2023-07-26 01:29:25 0.528 sec 4.0 0.3904423 0.3074741 0.1524452
2023-07-26 01:29:25 0.711 sec 5.0 0.3897861 0.3067226 0.1519332
2023-07-26 01:29:25 0.832 sec 6.0 0.3892457 0.3060523 0.1515122
2023-07-26 01:29:25 0.945 sec 7.0 0.3887421 0.3054030 0.1511204
2023-07-26 01:29:26 1.068 sec 8.0 0.3883223 0.3048271 0.1507942
2023-07-26 01:29:26 1.183 sec 9.0 0.3879452 0.3042796 0.1505015
--- --- --- --- --- --- ---
2023-07-26 01:29:27 2.914 sec 24.0 0.3851501 0.2994155 0.1483406
2023-07-26 01:29:27 3.029 sec 25.0 0.3850391 0.2992090 0.1482551
2023-07-26 01:29:28 3.159 sec 26.0 0.3849357 0.2990127 0.1481755
2023-07-26 01:29:28 3.282 sec 27.0 0.3848465 0.2988435 0.1481068
2023-07-26 01:29:28 3.471 sec 28.0 0.3847656 0.2986781 0.1480446
2023-07-26 01:29:28 3.610 sec 29.0 0.3846776 0.2985261 0.1479769
2023-07-26 01:29:28 3.720 sec 30.0 0.3846082 0.2983791 0.1479235
2023-07-26 01:29:28 3.837 sec 31.0 0.3845231 0.2982192 0.1478580
2023-07-26 01:29:28 3.952 sec 32.0 0.3844426 0.2980950 0.1477961
2023-07-26 01:29:32 7.151 sec 50.0 0.3833230 0.2962847 0.1469365
[34 rows x 7 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 746.8563232 1.0 0.2552778
TD005_WOE 400.7232666 0.5365467 0.1369684
AP003_bin_WOE 339.1337891 0.4540817 0.1159170
CR015_bin_WOE 263.8482056 0.3532784 0.0901841
TD014_bin_WOE 193.5591583 0.2591652 0.0661591
AP001_WOE 167.8301697 0.2247155 0.0573649
AP008_WOE 160.4264374 0.2148023 0.0548342
PA029_bin_WOE 153.1158142 0.2050137 0.0523355
PA023_bin_WOE 133.1398926 0.1782671 0.0455076
CR019_WOE 130.6924133 0.1749900 0.0446711
TD001_bin_WOE 79.9485245 0.1070467 0.0273267
CR009_bin_WOE 78.9457855 0.1057041 0.0269839
PA022_bin_WOE 77.4414291 0.1036899 0.0264697

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
createGains(gbm_v2)
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 1600 482 1118 1600 482 1118 0.15 0.09 315.0 1.53 6.0 30.12
1 1600 418 1182 3200 900 2300 0.29 0.18 630.0 1.43 11.0 28.12
2 1600 364 1236 4800 1264 3536 0.40 0.28 945.0 1.34 12.0 26.33
3 1600 360 1240 6400 1624 4776 0.52 0.37 1260.0 1.29 15.0 25.37
4 1600 305 1295 8000 1929 6071 0.61 0.47 1575.0 1.22 14.0 24.11
5 1600 265 1335 9600 2194 7406 0.70 0.58 1890.0 1.16 12.0 22.85
6 1600 256 1344 11200 2450 8750 0.78 0.68 2205.0 1.11 10.0 21.88
7 1600 269 1331 12800 2719 10081 0.86 0.78 2520.0 1.08 8.0 21.24
8 1600 243 1357 14400 2962 11438 0.94 0.89 2835.0 1.04 5.0 20.57
9 1600 188 1412 16000 3150 12850 1.00 1.00 3150.0 1.00 0.0 19.69
In [ ]:
ROC_AUC(gbm_v2,test_hex,'loan_default')
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Section 6 Deep Learning ¶

What's Deep Learning and how does the H2O's Deep Learning work?¶

Deep Learning involves the use of artificial neural networks composed of multiple layers of neurons. Each layer processes the input from the previous layer and gradually learns to extract higher-level representations of the data. Deep learning is particularly well-suited for complex tasks like image recognition, natural language processing, and speech recognition, where traditional machine learning techniques may struggle.

H2O's Deep Learning is based on a multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier, and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing, and grid search enable high predictive accuracy.

In [ ]:
!pip install h2o
import h2o
from h2o.estimators import H2ODeepLearningEstimator
h2o.init()
Collecting h2o
  Downloading h2o-3.42.0.2.tar.gz (249.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 249.1/249.1 MB 5.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from h2o) (2.27.1)
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from h2o) (0.9.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2023.7.22)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.4)
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... done
  Created wheel for h2o: filename=h2o-3.42.0.2-py2.py3-none-any.whl size=249153908 sha256=c9674c27a88bbe137b165755d325702177458c3c652f353a06c2f8855e00e358
  Stored in directory: /root/.cache/pip/wheels/31/f7/e0/e32942d9f76cb1cb14c949b7772eb78979d2e0132aae6c6780
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.42.0.2
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.19" 2023-04-18; OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1); OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpieh9mrkw
  JVM stdout: /tmp/tmpieh9mrkw/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpieh9mrkw/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 03 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.42.0.2
H2O_cluster_version_age: 1 day
H2O_cluster_name: H2O_from_python_unknownUser_k99jd3
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.170 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy: {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"}
H2O_internal_security: False
Python_version: 3.10.6 final
In [ ]:
train_df_dl = train_df_rf
test_df_dl = test_df_rf
In [ ]:
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_dl.columns.tolist()
predictors=predictors[2:17]
predictors
Out[ ]:
['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR009_bin_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD006_bin_WOE',
 'TD009_bin_WOE',
 'TD010_bin_WOE',
 'TD014_bin_WOE']
In [ ]:
#Use 50% training data
train_smpl = train_df_dl.sample(frac=0.5, random_state=1)
test_smpl = test_df_dl.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
# Build and train the model:
dl_v1 = H2ODeepLearningEstimator(distribution="tweedie",
                               hidden=[1],
                               epochs=1000,
                               train_samples_per_iteration=-1,
                               reproducible=True,
                               activation="Tanh",
                               single_node_mode=False,
                               balance_classes=False,
                               force_load_balance=False,
                               seed=23123,
                               tweedie_power=1.5,
                               score_training_samples=0,
                               score_validation_samples=0,
                               stopping_rounds=0)
dl_v1.train(x=predictors,
          y=target,
          training_frame=train_hex)
deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ODeepLearningEstimator : Deep Learning
Model Key: DeepLearning_model_python_1690391326174_1
Status of Neuron Layers: predicting loan_default, regression, tweedie distribution, Automatic loss, 18 weights/biases, 5.2 KB, 32,000,000 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
1 15 Input 0.0
2 1 Tanh 0.0 0.0 0.0 0.0004968 0.0001334 0.0 0.1476793 0.1542717 -0.0440504 0.0000000
3 1 Linear 0.0 0.0 0.0004108 0.0000000 0.0 0.6655666 0.0000000 -1.7210161 0.0000000
ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.14966607913911537
RMSE: 0.3868670044590458
MAE: 0.2986335369918558
RMSLE: 0.27133556429679573
Mean Residual Deviance: 1.8961575096973406
Scoring History:
timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae training_r2
2023-07-26 17:46:55 0.000 sec None 0.0 0 0.0 nan nan nan nan
2023-07-26 17:46:55 1.218 sec 72562 obs/sec 1.0 1 32000.0 0.3889035 1.9117893 0.2986558 0.0307148
2023-07-26 17:46:56 1.646 sec 92485 obs/sec 2.0 2 64000.0 0.3879778 1.9027225 0.3033854 0.0353233
2023-07-26 17:46:56 1.923 sec 121518 obs/sec 3.0 3 96000.0 0.3872472 1.8996910 0.2998927 0.0389530
2023-07-26 17:46:57 2.358 sec 128256 obs/sec 4.0 4 128000.0 0.3870386 1.8971095 0.3009621 0.0399884
2023-07-26 17:46:57 2.843 sec 127490 obs/sec 5.0 5 160000.0 0.3872056 1.8989280 0.2953510 0.0391596
2023-07-26 17:46:58 3.372 sec 125162 obs/sec 6.0 6 192000.0 0.3868670 1.8961575 0.2986335 0.0408393
2023-07-26 17:46:58 3.821 sec 122270 obs/sec 7.0 7 224000.0 0.3870500 1.8974509 0.2988586 0.0399319
2023-07-26 17:46:58 4.219 sec 126046 obs/sec 8.0 8 256000.0 0.3872289 1.8987255 0.2944169 0.0390442
2023-07-26 17:46:59 4.632 sec 127886 obs/sec 9.0 9 288000.0 0.3871469 1.8980029 0.3008467 0.0394510
--- --- --- --- --- --- --- --- --- --- ---
2023-07-26 17:48:48 1 min 53.331 sec 498007 obs/sec 992.0 992 31744000.0000000 0.3888569 1.9172578 0.2952023 0.0309469
2023-07-26 17:48:48 1 min 53.423 sec 498056 obs/sec 993.0 993 31776000.0000000 0.3887163 1.9155428 0.2993813 0.0316473
2023-07-26 17:48:48 1 min 53.516 sec 498159 obs/sec 994.0 994 31808000.0000000 0.3886963 1.9153043 0.3025350 0.0317471
2023-07-26 17:48:48 1 min 53.609 sec 498247 obs/sec 995.0 995 31840000.0000000 0.3887182 1.9155005 0.2999262 0.0316382
2023-07-26 17:48:48 1 min 53.694 sec 498358 obs/sec 996.0 996 31872000.0000000 0.3887248 1.9154189 0.3029753 0.0316053
2023-07-26 17:48:48 1 min 53.781 sec 498453 obs/sec 997.0 997 31904000.0000000 0.3886923 1.9152971 0.3027701 0.0317669
2023-07-26 17:48:48 1 min 53.869 sec 498571 obs/sec 998.0 998 31936000.0000000 0.3887642 1.9158048 0.2997823 0.0314090
2023-07-26 17:48:48 1 min 53.960 sec 498635 obs/sec 999.0 999 31968000.0000000 0.3887361 1.9159206 0.2986089 0.0315489
2023-07-26 17:48:48 1 min 54.047 sec 498745 obs/sec 1000.0 1000 32000000.0000000 0.3887176 1.9155962 0.3005151 0.0316409
2023-07-26 17:48:48 1 min 54.104 sec 498597 obs/sec 1000.0 1000 32000000.0000000 0.3868670 1.8961575 0.2986335 0.0408393
[1002 rows x 11 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
AP003_bin_WOE 1.0 1.0 0.2347512
CR015_bin_WOE 0.6420705 0.6420705 0.1507269
TD009_bin_WOE 0.5127221 0.5127221 0.1203622
TD005_WOE 0.4553512 0.4553512 0.1068943
TD014_bin_WOE 0.4146544 0.4146544 0.0973406
AP008_WOE 0.2762516 0.2762516 0.0648504
PA023_bin_WOE 0.2164090 0.2164090 0.0508023
PA022_bin_WOE 0.1987160 0.1987160 0.0466488
TD001_bin_WOE 0.1870835 0.1870835 0.0439181
PA029_bin_WOE 0.1362096 0.1362096 0.0319754
TD006_bin_WOE 0.0837359 0.0837359 0.0196571
CR019_WOE 0.0550442 0.0550442 0.0129217
CR009_bin_WOE 0.0322155 0.0322155 0.0075626
TD010_bin_WOE 0.0253704 0.0253704 0.0059557
AP001_WOE 0.0239944 0.0239944 0.0056327

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
VarImp(dl_v1)
In [ ]:
createGains(dl_v1)
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%
Out[ ]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 800 228 572 800 228 572 0.15 0.09 151.2 1.51 6.0 28.50
1 800 204 596 1600 432 1168 0.29 0.18 302.4 1.43 11.0 27.00
2 800 193 607 2400 625 1775 0.41 0.27 453.6 1.38 14.0 26.04
3 800 178 622 3200 803 2397 0.53 0.37 604.8 1.33 16.0 25.09
4 800 145 655 4000 948 3052 0.63 0.47 756.0 1.25 16.0 23.70
5 800 133 667 4800 1081 3719 0.71 0.57 907.2 1.19 14.0 22.52
6 800 113 687 5600 1194 4406 0.79 0.68 1058.4 1.13 11.0 21.32
7 800 124 676 6400 1318 5082 0.87 0.78 1209.6 1.09 9.0 20.59
8 800 113 687 7200 1431 5769 0.95 0.89 1360.8 1.05 6.0 19.88
9 800 81 719 8000 1512 6488 1.00 1.00 1512.0 1.00 0.0 18.90
In [ ]:
ROC_AUC(dl_v1,test_hex,'loan_default')
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Hyperparameter Tuning¶

Neural networks often benefit from deeper architectures with more neurons in each layer to learn intricate patterns and representations from the data. 'hidden' specifies the hidden layer sizes, let's set the "hidden" parameter to 15, which is the number of features.¶

In [ ]:
#Use 50% training data
train_smpl = train_df_dl.sample(frac=0.5, random_state=1)
test_smpl = test_df_dl.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
# Build and train the model:
dl_v2 = H2ODeepLearningEstimator(distribution="tweedie",
                               hidden=[15],
                               epochs=1000,
                               train_samples_per_iteration=-1,
                               reproducible=True,
                               activation="Tanh",
                               single_node_mode=False,
                               balance_classes=False,
                               force_load_balance=False,
                               seed=23123,
                               tweedie_power=1.5,
                               score_training_samples=0,
                               score_validation_samples=0,
                               stopping_rounds=0)
dl_v2.train(x=predictors,
          y=target,
          training_frame=train_hex)
deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ODeepLearningEstimator : Deep Learning
Model Key: DeepLearning_model_python_1690391326174_7
Status of Neuron Layers: predicting loan_default, regression, tweedie distribution, Automatic loss, 256 weights/biases, 8.2 KB, 32,000,000 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
1 15 Input 0.0
2 15 Tanh 0.0 0.0 0.0 0.0578403 0.1227143 0.0 -0.0593437 0.4779193 -0.5849661 1.6329694
3 1 Linear 0.0 0.0 0.0004323 0.0001015 0.0 0.0434363 0.1713431 -0.5021105 0.0000000
ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.14942329051863198
RMSE: 0.38655308887477796
MAE: 0.3020197073141356
RMSLE: 0.27189718953532904
Mean Residual Deviance: 1.8937793450713964
Scoring History:
timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae training_r2
2023-07-26 18:53:52 0.000 sec None 0.0 0 0.0 nan nan nan nan
2023-07-26 18:53:54 1.970 sec 19138 obs/sec 1.0 1 32000.0 0.3881327 1.9006293 0.2980732 0.0345530
2023-07-26 18:53:55 3.196 sec 24233 obs/sec 2.0 2 64000.0 0.3873059 1.8969044 0.2986840 0.0386616
2023-07-26 18:53:56 4.086 sec 28012 obs/sec 3.0 3 96000.0 0.3870877 1.8969663 0.3064743 0.0397447
2023-07-26 18:53:57 4.605 sec 32947 obs/sec 4.0 4 128000.0 0.3867414 1.8941267 0.2975281 0.0414620
2023-07-26 18:53:57 5.045 sec 37488 obs/sec 5.0 5 160000.0 0.3872732 1.8983977 0.2946867 0.0388241
2023-07-26 18:53:58 5.475 sec 41406 obs/sec 6.0 6 192000.0 0.3868850 1.8966583 0.2902008 0.0407501
2023-07-26 18:53:58 5.901 sec 44728 obs/sec 7.0 7 224000.0 0.3871962 1.8970791 0.2995006 0.0392064
2023-07-26 18:53:58 6.330 sec 47583 obs/sec 8.0 8 256000.0 0.3867782 1.8951614 0.2967038 0.0412796
2023-07-26 18:53:59 6.753 sec 50130 obs/sec 9.0 9 288000.0 0.3868773 1.8952747 0.2960012 0.0407884
--- --- --- --- --- --- --- --- --- --- ---
2023-07-26 19:00:37 6 min 45.268 sec 93817 obs/sec 992.0 992 31744000.0000000 0.3870540 1.8995680 0.2935097 0.0399120
2023-07-26 19:00:38 6 min 45.611 sec 93832 obs/sec 993.0 993 31776000.0000000 0.3874609 1.8991271 0.2975654 0.0378924
2023-07-26 19:00:38 6 min 45.956 sec 93846 obs/sec 994.0 994 31808000.0000000 0.3870855 1.8998078 0.2924485 0.0397554
2023-07-26 19:00:38 6 min 46.309 sec 93858 obs/sec 995.0 995 31840000.0000000 0.3874081 1.9044613 0.3107757 0.0381545
2023-07-26 19:00:39 6 min 46.657 sec 93873 obs/sec 996.0 996 31872000.0000000 0.3870366 1.9013690 0.3041471 0.0399982
2023-07-26 19:00:39 6 min 47.002 sec 93888 obs/sec 997.0 997 31904000.0000000 0.3872504 1.9025377 0.3066765 0.0389372
2023-07-26 19:00:39 6 min 47.354 sec 93901 obs/sec 998.0 998 31936000.0000000 0.3880618 1.9069717 0.3143574 0.0349056
2023-07-26 19:00:40 6 min 47.711 sec 93915 obs/sec 999.0 999 31968000.0000000 0.3872698 1.8996847 0.3028403 0.0388408
2023-07-26 19:00:40 6 min 48.056 sec 93929 obs/sec 1000.0 1000 32000000.0000000 0.3879566 1.9117697 0.2843278 0.0354289
2023-07-26 19:00:40 6 min 48.130 sec 93924 obs/sec 1000.0 1000 32000000.0000000 0.3865531 1.8937793 0.3020197 0.0423953
[1002 rows x 11 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
AP003_bin_WOE 1.0 1.0 0.1622127
TD014_bin_WOE 0.6524100 0.6524100 0.1058292
CR015_bin_WOE 0.5777300 0.5777300 0.0937151
TD006_bin_WOE 0.5119238 0.5119238 0.0830405
TD009_bin_WOE 0.4652037 0.4652037 0.0754619
TD010_bin_WOE 0.4546128 0.4546128 0.0737439
TD005_WOE 0.3899076 0.3899076 0.0632480
CR019_WOE 0.3342157 0.3342157 0.0542140
PA029_bin_WOE 0.3297463 0.3297463 0.0534890
CR009_bin_WOE 0.3015221 0.3015221 0.0489107
TD001_bin_WOE 0.2873588 0.2873588 0.0466132
PA022_bin_WOE 0.2786159 0.2786159 0.0451950
PA023_bin_WOE 0.2548023 0.2548023 0.0413322
AP008_WOE 0.1798768 0.1798768 0.0291783
AP001_WOE 0.1468209 0.1468209 0.0238162

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
ROC_AUC(dl_v2,test_hex,'loan_default')
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Try Dropping Two Less Important Features¶

Turns out dropping less important features doesn't lead to better performance with other factors held constant¶

In [ ]:
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_dl.columns.tolist()
predictors=predictors[2:17]
values_to_remove = ['AP001_WOE', 'TD010_bin_WOE','CR009_bin_WOE']
predictors = [item for item in predictors if item not in values_to_remove]
predictors
Out[ ]:
['AP003_bin_WOE',
 'AP008_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD006_bin_WOE',
 'TD009_bin_WOE',
 'TD014_bin_WOE']
In [ ]:
#Use 50% training data and all test data
train_smpl = train_df_dl.sample(frac=0.5, random_state=1)
test_smpl = test_df_dl.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [ ]:
# Build and train the model:
dl_v3 = H2ODeepLearningEstimator(distribution="tweedie",
                               hidden=[15],
                               epochs=1000,
                               train_samples_per_iteration=-1,
                               reproducible=True,
                               activation="Tanh",
                               single_node_mode=False,
                               balance_classes=False,
                               force_load_balance=False,
                               seed=23123,
                               tweedie_power=1.5,
                               score_training_samples=0,
                               score_validation_samples=0,
                               stopping_rounds=0)
dl_v3.train(x=predictors,
          y=target,
          training_frame=train_hex)
deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2ODeepLearningEstimator : Deep Learning
Model Key: DeepLearning_model_python_1690391326174_8
Status of Neuron Layers: predicting loan_default, regression, tweedie distribution, Automatic loss, 211 weights/biases, 7.2 KB, 32,000,000 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
1 12 Input 0.0
2 15 Tanh 0.0 0.0 0.0 0.0351888 0.0610454 0.0 -0.0618614 0.3207351 0.1599673 1.3066607
3 1 Linear 0.0 0.0 0.0004093 0.0000196 0.0 -0.0319876 0.1622897 -0.8569771 0.0000000
ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.1495221752667646
RMSE: 0.38668097349981495
MAE: 0.3011394386382141
RMSLE: 0.27164728953221895
Mean Residual Deviance: 1.8940746121314866
Scoring History:
timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae training_r2
2023-07-26 19:04:44 0.000 sec None 0.0 0 0.0 nan nan nan nan
2023-07-26 19:04:45 1.971 sec 19161 obs/sec 1.0 1 32000.0 0.3879453 1.8996317 0.2972218 0.0354848
2023-07-26 19:04:47 3.345 sec 23503 obs/sec 2.0 2 64000.0 0.3876217 1.8976999 0.2994473 0.0370933
2023-07-26 19:04:48 4.212 sec 28152 obs/sec 3.0 3 96000.0 0.3871285 1.8966283 0.3075731 0.0395422
2023-07-26 19:04:49 5.255 sec 29540 obs/sec 4.0 4 128000.0 0.3868933 1.8949465 0.2987211 0.0407091
2023-07-26 19:04:49 5.999 sec 32566 obs/sec 5.0 5 160000.0 0.3870850 1.8962858 0.2952595 0.0397579
2023-07-26 19:04:50 6.946 sec 33790 obs/sec 6.0 6 192000.0 0.3869928 1.8979457 0.2895132 0.0402157
2023-07-26 19:04:52 8.231 sec 32719 obs/sec 7.0 7 224000.0 0.3872329 1.8974479 0.2994648 0.0390243
2023-07-26 19:04:53 9.759 sec 31569 obs/sec 8.0 8 256000.0 0.3868305 1.8955196 0.2969099 0.0410204
2023-07-26 19:04:54 11.046 sec 31355 obs/sec 9.0 9 288000.0 0.3871136 1.8969040 0.2979204 0.0396161
--- --- --- --- --- --- --- --- --- --- ---
2023-07-26 19:10:56 6 min 12.565 sec 102812 obs/sec 992.0 992 31744000.0000000 0.3871540 1.9004186 0.2946634 0.0394155
2023-07-26 19:10:56 6 min 12.875 sec 102829 obs/sec 993.0 993 31776000.0000000 0.3884846 1.9017094 0.2992910 0.0328013
2023-07-26 19:10:57 6 min 13.189 sec 102845 obs/sec 994.0 994 31808000.0000000 0.3873462 1.9022251 0.2928724 0.0384615
2023-07-26 19:10:57 6 min 13.514 sec 102858 obs/sec 995.0 995 31840000.0000000 0.3877552 1.9052790 0.3105979 0.0364300
2023-07-26 19:10:57 6 min 13.829 sec 102875 obs/sec 996.0 996 31872000.0000000 0.3871145 1.9019268 0.3053463 0.0396115
2023-07-26 19:10:58 6 min 14.147 sec 102890 obs/sec 997.0 997 31904000.0000000 0.3874727 1.9045983 0.3077272 0.0378336
2023-07-26 19:10:58 6 min 14.462 sec 102906 obs/sec 998.0 998 31936000.0000000 0.3881551 1.9090649 0.3144513 0.0344417
2023-07-26 19:10:58 6 min 14.878 sec 102905 obs/sec 999.0 999 31968000.0000000 0.3875725 1.9004903 0.3021549 0.0373376
2023-07-26 19:10:59 6 min 15.360 sec 102884 obs/sec 1000.0 1000 32000000.0000000 0.3882399 1.9135242 0.2852468 0.0340197
2023-07-26 19:10:59 6 min 15.496 sec 102873 obs/sec 1000.0 1000 32000000.0000000 0.3866810 1.8940746 0.3011394 0.0417616
[1002 rows x 11 columns]
Variable Importances:
variable relative_importance scaled_importance percentage
AP003_bin_WOE 1.0 1.0 0.1875495
TD005_WOE 0.7480226 0.7480226 0.1402912
CR015_bin_WOE 0.5754523 0.5754523 0.1079258
TD009_bin_WOE 0.4181940 0.4181940 0.0784321
TD014_bin_WOE 0.4101961 0.4101961 0.0769321
PA029_bin_WOE 0.3894845 0.3894845 0.0730476
CR019_WOE 0.3676064 0.3676064 0.0689444
PA023_bin_WOE 0.3661864 0.3661864 0.0686781
TD001_bin_WOE 0.3494761 0.3494761 0.0655441
PA022_bin_WOE 0.2866926 0.2866926 0.0537690
TD006_bin_WOE 0.2157678 0.2157678 0.0404671
AP008_WOE 0.2048479 0.2048479 0.0384191

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
ROC_AUC(dl_v3,test_hex,'loan_default')
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Insights:¶

  • For models built in Section 5 GBM and Section 6 Deep Learning, the deep learning model 'dl_v2' performs the best.

We get AUC of 60.66% from ROC curve and the precision-recall rate of 25.34%.

  • Using same sample sizes and features, deep learning model performs better than GBM model overall, indicating better prediction power for this data.
  • 'dl_v2' model has hyperparameter 'hidden' set to 15, 'balance_classes' set to False, use 50% of the train and test data and all 15 WOE transformed features.

Next Step to Improve the Model:¶

  • Increase Model Complexity: Consider adding more hidden layers and neurons to the deep learning model to capture complex patterns in the data.
  • Hyperparameter Tuning: Optimize hyperparameters such as learning rate, activation functions, and regularization to find the best configuration for the model's architecture. Use techniques like cross-validation to prevent overfitting and enhance generalization.

Section 7 GLM ¶

What's GLM?¶

Generalized Linear Model (GLM) is a versatile statistical framework for analyzing data and building predictive models. It extends traditional linear regression to handle a wider range of data distributions, making it suitable for various types of data, including binary, count, and continuous outcomes. GLM incorporates a link function to connect the linear predictor to the response variable's distribution, allowing for flexible modeling. It's used for tasks like regression, classification, and more, offering interpretability and adaptability. Regularization techniques like Ridge and LASSO can be applied to control model complexity.

In [163]:
!pip install h2o
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
Requirement already satisfied: h2o in /usr/local/lib/python3.10/dist-packages (3.42.0.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from h2o) (2.31.0)
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from h2o) (0.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2023.7.22)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
H2O_cluster_uptime: 8 mins 37 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.42.0.2
H2O_cluster_version_age: 16 days
H2O_cluster_name: H2O_from_python_unknownUser_dsx5kv
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.170 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"}
H2O_internal_security: False
Python_version: 3.10.12 final
In [164]:
train_df_glm = train_df_rf
test_df_glm = test_df_rf
In [165]:
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_glm.columns.tolist()
predictors=predictors[2:17]
predictors
Out[165]:
['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR009_bin_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD006_bin_WOE',
 'TD009_bin_WOE',
 'TD010_bin_WOE',
 'TD014_bin_WOE']
In [166]:
#Use 50% training data
train_smpl = train_df_glm.sample(frac=0.5, random_state=1)
test_smpl = test_df_glm.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
In [167]:
glm_v1 = H2OGeneralizedLinearEstimator(family= "binomial", lambda_ = 0.05) #, compute_p_values = True)
glm_v1.train(predictors,target,training_frame=train_hex)
glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Out[167]:
Model Details
=============
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_model_python_1691688810087_1
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
binomial logit Elastic Net (alpha = 0.5, lambda = 0.05 ) 15 8 4 Key_Frame__upload_b373c209abd41bd81c47e5a061084af2.hex
ModelMetricsBinomialGLM: glm
** Reported on train data. **

MSE: 0.1526465109366919
RMSE: 0.3907000267938203
LogLoss: 0.4805152258712424
AUC: 0.6264445896951362
AUCPR: 0.2790494840657378
Gini: 0.25288917939027233
Null degrees of freedom: 31999
Residual degrees of freedom: 31991
Null deviance: 31437.68171069583
Residual deviance: 30752.974455759515
AIC: 30770.974455759515
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.18583615474294712
0 1 Error Rate
0 12789.0 13020.0 0.5045 (13020.0/25809.0)
1 1976.0 4215.0 0.3192 (1976.0/6191.0)
Total 14765.0 17235.0 0.4686 (14996.0/32000.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
max f1 0.1858362 0.3598566 266.0
max f2 0.1531457 0.5526057 373.0
max f0point5 0.2139063 0.3089484 165.0
max accuracy 0.2811279 0.8066875 5.0
max precision 0.2844217 0.5909091 2.0
max recall 0.1366615 1.0 399.0
max specificity 0.2869710 0.9998450 0.0
max absolute_mcc 0.2056047 0.1504460 194.0
max min_per_class_accuracy 0.1937387 0.5894455 236.0
max mean_per_class_accuracy 0.1943942 0.5901554 233.0
max tns 0.2869710 25805.0 0.0
max fns 0.2869710 6187.0 0.0
max fps 0.1366615 25809.0 399.0
max tps 0.1366615 6191.0 399.0
max tnr 0.2869710 0.9998450 0.0
max fnr 0.2869710 0.9993539 0.0
max fpr 0.1366615 1.0 399.0
max tpr 0.1366615 1.0 399.0
Gains/Lift Table: Avg response rate: 19.35 %, avg score: 19.35 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain kolmogorov_smirnov
1 0.0100938 0.2647237 2.1443292 2.1443292 0.4148607 0.2714893 0.4148607 0.2714893 0.0216443 0.0216443 114.4329155 114.4329155 0.0143213
2 0.02 0.2589252 1.5000915 1.8252302 0.2902208 0.2614576 0.353125 0.2665205 0.0148603 0.0365046 50.0091463 82.5230173 0.0204637
3 0.030375 0.2561017 2.0861997 1.9143679 0.4036145 0.2572176 0.3703704 0.2633430 0.0216443 0.0581489 108.6199750 91.4367930 0.0344363
4 0.04 0.2521292 1.8795612 1.9059926 0.3636364 0.2542846 0.36875 0.2611633 0.0180908 0.0762397 87.9561240 90.5992570 0.0449328
5 0.05 0.2499286 1.7606203 1.8769181 0.340625 0.2510795 0.363125 0.2591465 0.0176062 0.0938459 76.0620255 87.6918107 0.0543636
6 0.1003125 0.2403076 1.5923736 1.7342026 0.3080745 0.2450548 0.3355140 0.2520787 0.0801163 0.1739622 59.2373622 73.4202649 0.0913166
7 0.1500313 0.2307312 1.4911855 1.6536694 0.2884978 0.2352361 0.3199333 0.2464973 0.0741399 0.2481021 49.1185528 65.3669377 0.1215958
8 0.2000937 0.2215173 1.2647734 1.5563695 0.2446941 0.2258361 0.3011089 0.2413279 0.0633177 0.3114198 26.4773419 55.6369467 0.1380307
9 0.3022813 0.2093273 1.2645366 1.4577141 0.2446483 0.2149020 0.2820221 0.2323945 0.1292198 0.4406396 26.4536614 45.7714093 0.1715475
10 0.400375 0.1986757 1.0373813 1.3547306 0.2007009 0.2036881 0.2620980 0.2253613 0.1017606 0.5424003 3.7381283 35.4730586 0.1760939
11 0.50225 0.1882932 1.0210745 1.2870527 0.1975460 0.1932596 0.2490045 0.2188499 0.1040220 0.6464222 2.1074526 28.7052714 0.1787559
12 0.6055625 0.1815325 0.9099328 1.2227139 0.1760436 0.1844916 0.2365569 0.2129882 0.0940074 0.7404297 -9.0067222 22.2713850 0.1672188
13 0.7000937 0.1729501 0.7518245 1.1591313 0.1454545 0.1771143 0.2242557 0.2081442 0.0710709 0.8115006 -24.8175504 15.9131281 0.1381308
14 0.8133125 0.1623638 0.7504238 1.1022364 0.1451835 0.1665942 0.2132483 0.2023602 0.0849620 0.8964626 -24.9576226 10.2236357 0.1030960
15 0.9146875 0.1561530 0.6580492 1.0530070 0.1273120 0.1583600 0.2037239 0.1974836 0.0667097 0.9631723 -34.1950777 5.3007007 0.0601153
16 1.0 0.1366615 0.4316794 1.0 0.0835165 0.1504080 0.1934687 0.1934675 0.0368277 1.0 -56.8320550 0.0 0.0
Scoring History:
timestamp duration iterations negative_log_likelihood objective training_rmse training_logloss training_r2 training_auc training_pr_auc training_lift training_classification_error
2023-08-10 17:43:35 0.000 sec 0 15718.8408553 0.4912138
2023-08-10 17:43:35 0.196 sec 1 15458.4752627 0.4882442
2023-08-10 17:43:35 0.223 sec 2 15458.9172757 0.4882350
2023-08-10 17:43:35 0.419 sec 3 15376.9680691 0.4879435
2023-08-10 17:43:35 0.478 sec 4 15376.4872279 0.4879434 0.3907000 0.4805152 0.0217387 0.6264446 0.2790495 2.1443292 0.468625
Variable Importances:
variable relative_importance scaled_importance percentage
TD009_bin_WOE 0.1053973 1.0 0.3665086
TD005_WOE 0.0680206 0.6453740 0.2365351
TD014_bin_WOE 0.0383866 0.3642091 0.1334858
AP003_bin_WOE 0.0334363 0.3172407 0.1162715
CR015_bin_WOE 0.0242039 0.2296448 0.0841668
PA023_bin_WOE 0.0131986 0.1252271 0.0458968
PA029_bin_WOE 0.0046987 0.0445805 0.0163392
PA022_bin_WOE 0.0002290 0.0021724 0.0007962
AP001_WOE 0.0 0.0 0.0
AP008_WOE 0.0 0.0 0.0
CR009_bin_WOE 0.0 0.0 0.0
CR019_WOE 0.0 0.0 0.0
TD001_bin_WOE 0.0 0.0 0.0
TD006_bin_WOE 0.0 0.0 0.0
TD010_bin_WOE 0.0 0.0 0.0

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [168]:
glm_v1.predict(test_hex)
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[168]:
predict p0 p1
10.7751780.224822
10.7773860.222614
10.7819840.218016
00.8310520.168948
10.7948220.205178
10.7873060.212694
10.7958950.204105
00.8448350.155165
00.8341720.165828
10.7852590.214741
[8000 rows x 3 columns]
In [169]:
glm_v1.predict(test_hex)['p1']
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[169]:
p1
0.224822
0.222614
0.218016
0.168948
0.205178
0.212694
0.204105
0.155165
0.165828
0.214741
[8000 rows x 1 column]
In [170]:
predictions = glm_v1.predict(test_hex)['p1']
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[170]:
loan_default p1
0 0 0.224822
1 0 0.222614
2 0 0.218016
3 0 0.168948
4 0 0.205178
In [171]:
def createGains(model):
    predictions = model.predict(test_hex)['p1']
    test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()

    #sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
    test_scores = test_scores.sort_values(by='p1',ascending=False)
    test_scores['row_id'] = range(0,0+len(test_scores))
    test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
    #see count by decile
    test_scores.loc[test_scores['decile'] == 10]=9
    test_scores['decile'].value_counts()

    #create gains table
    gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
    gains.columns = ['count','actual']
    gains

    #add features to gains table
    gains['non_actual'] = gains['count'] - gains['actual']
    gains['cum_count'] = gains['count'].cumsum()
    gains['cum_actual'] = gains['actual'].cumsum()
    gains['cum_non_actual'] = gains['non_actual'].cumsum()
    gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
    gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
    gains['if_random'] = np.max(gains['cum_actual']) /10
    gains['if_random'] = gains['if_random'].cumsum()
    gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
    gains['K_S'] = np.abs( gains['percent_cum_actual'] -  gains['percent_cum_non_actual'] ) * 100
    gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
    gains = pd.DataFrame(gains)
    return(gains)

createGains(glm_v1)
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[171]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 800 201 599 800 201 599 0.13 0.09 151.2 1.33 4.0 25.12
1 800 217 583 1600 418 1182 0.28 0.18 302.4 1.38 10.0 26.12
2 800 195 605 2400 613 1787 0.41 0.28 453.6 1.35 13.0 25.54
3 800 171 629 3200 784 2416 0.52 0.37 604.8 1.30 15.0 24.50
4 800 161 639 4000 945 3055 0.62 0.47 756.0 1.25 15.0 23.62
5 800 127 673 4800 1072 3728 0.71 0.57 907.2 1.18 14.0 22.33
6 800 125 675 5600 1197 4403 0.79 0.68 1058.4 1.13 11.0 21.38
7 800 107 693 6400 1304 5096 0.86 0.79 1209.6 1.08 7.0 20.38
8 800 117 683 7200 1421 5779 0.94 0.89 1360.8 1.04 5.0 19.74
9 800 91 709 8000 1512 6488 1.00 1.00 1512.0 1.00 0.0 18.90
In [172]:
def ROC_AUC(my_result,df,target):
    from sklearn.metrics import roc_curve,auc
    from sklearn.metrics import average_precision_score
    from sklearn.metrics import precision_recall_curve
    import matplotlib.pyplot as plt

    # ROC
    y_actual = df[target].as_data_frame()
    y_pred = my_result.predict(df)['p1'].as_data_frame()
    fpr = list()
    tpr = list()
    roc_auc = list()
    fpr,tpr,_ = roc_curve(y_actual,y_pred)
    roc_auc = auc(fpr,tpr)

    # Precision-Recall
    average_precision = average_precision_score(y_actual,y_pred)

    print('')
    print('   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
    print('')
    print('	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
    print('')
    print('   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
    print('')

    # plotting
    plt.figure(figsize=(10,4))

    # ROC
    plt.subplot(1,2,1)
    plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
    plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
    plt.legend(loc='lower right')

    # Precision-Recall
    plt.subplot(1,2,2)
    precision,recall,_ = precision_recall_curve(y_actual,y_pred)
    plt.step(recall,precision,color='b',alpha=0.2,where='post')
    plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0,1.05])
    plt.xlim([0.0,1.0])
    plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
    plt.show()
In [173]:
ROC_AUC(glm_v1,test_hex,'loan_default')
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

In [174]:
# Print the Coefficients table
coefs = glm_v1._model_json['output']['coefficients_table'].as_data_frame()
coefs = pd.DataFrame(coefs)
coefs.sort_values(by='standardized_coefficients',ascending=False)
Out[174]:
names coefficients standardized_coefficients
13 TD009_bin_WOE 0.305424 0.105397
11 TD005_WOE 0.210866 0.068021
15 TD014_bin_WOE 0.136815 0.038387
2 AP003_bin_WOE 0.169291 0.033436
5 CR015_bin_WOE 0.128310 0.024204
8 PA023_bin_WOE 0.069992 0.013199
9 PA029_bin_WOE 0.023497 0.004699
7 PA022_bin_WOE 0.001180 0.000229
1 AP001_WOE 0.000000 0.000000
3 AP008_WOE 0.000000 0.000000
4 CR009_bin_WOE 0.000000 0.000000
6 CR019_WOE 0.000000 0.000000
10 TD001_bin_WOE 0.000000 0.000000
12 TD006_bin_WOE 0.000000 0.000000
14 TD010_bin_WOE 0.000000 0.000000
0 Intercept -1.413201 -1.439362

Hyperparameter Tuning¶

To get the best possible model, GLM needs to find the optimal values of the regularization parameters 𝛼 and 𝜆. When performing regularization, penalties are introduced to the model buidling process to avoid overfitting, to reduce variance of the prediction error, and to handle correlated predictors.

Lambda (λ) is a regularization parameter that controls the extent of regularization in models like Ridge Regression, LASSO, and Elastic Net. When λ is 0, no regularization occurs, risking overfitting. Alpha (α) adjusts the balance between LASSO and Ridge penalties in Elastic Net; α=0 implies Ridge, α=1 means LASSO, and 0<α<1 blends both. Ridge (λ>0, α=0) minimizes coefficients with L2 penalty, LASSO (λ>0, α=1) enforces sparsity via L1 penalty, while Elastic Net (λ>0, 0<α<1) combines both penalties, offering flexibility in feature selection and coefficient control.

We'll perform grid search to find the best value for the regularization parameter lambda (λ) in a GLM.

In [177]:
train, valid= train_hex.split_frame(ratios = [.8])
In [178]:
# Example of values to grid over for `lambda`
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch

# select the values for lambda_ to grid over
hyper_params = {'lambda': [1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0]}

# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: {'strategy': "RandomDiscrete"}
# initialize the glm estimator
glm_v2 = H2OGeneralizedLinearEstimator(family = 'binomial')

# build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = glm_v2, hyper_params = hyper_params,
                     search_criteria = {'strategy': "Cartesian"})

# train using the grid
grid.train(x = predictors, y = target, training_frame = train, validation_frame = valid)
glm Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%
Out[178]:
Hyper-Parameter Search Summary: ordered by increasing logloss
lambda model_ids logloss
0.001 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_5 0.4729397
0.0001 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_6 0.4731181
1e-05 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_7 0.4731423
0.0 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_8 0.4731491
0.01 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_4 0.4732497
0.1 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_3 0.4911780
1.0 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_1 0.4929956
0.5 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_2 0.4929956
In [179]:
# sort the grid models by decreasing AUC
sorted_grid = grid.get_grid(sort_by = 'auc', decreasing = True)
print(sorted_grid)
Hyper-Parameter Search Summary: ordered by decreasing auc
    lambda    model_ids                                                    auc
--  --------  -----------------------------------------------------------  --------
    0.001     Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_5  0.641245
    0.01      Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_4  0.641218
    0.0001    Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_6  0.640806
    1e-05     Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_7  0.64072
    0         Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_8  0.640669
    0.1       Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_3  0.594032
    1         Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_1  0.5
    0.5       Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_2  0.5

After conducting a grid search over different lambda values to find the best regularization strength for the GLM model in terms of binary classification performance, the results are sorted by AUC to evaluate binary classification models' predictive accuracy. Using 50% of the train data, the hyper-parameter search evaluated models with different lambda values. Lambda values of 0.001, 0.01, and 0.0001 had the highest AUCs, indicating better model performance. Lower lambda values led to better results, while higher values and extremes like 0.1, 1, and 0.5 resulted in less effective models.

In [181]:
glm_v2 = H2OGeneralizedLinearEstimator(family= "binomial", lambda_ = 0.001) #, compute_p_values = True)
glm_v2.train(predictors,target,training_frame=train_hex)
glm_v2.predict(test_hex)
glm_v2.predict(test_hex)['p1']
glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[181]:
p1
0.357331
0.294859
0.253227
0.16267
0.221681
0.247293
0.177558
0.0841446
0.108281
0.277609
[8000 rows x 1 column]
In [182]:
predictions = glm_v2.predict(test_hex)['p1']
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[182]:
loan_default p1
0 0 0.357331
1 0 0.294859
2 0 0.253227
3 0 0.162670
4 0 0.221681
In [184]:
ROC_AUC(glm_v2,test_hex,'loan_default')
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Insights:¶

  • After fine-tuning the model's hyperparameters and optimizing the lambda value to 0.001 in "glm_v2", a significant improvement in the area under the ROC curve (AUC) was achieved. The AUC increased from 0.5949 to 0.628, indicating an enhancement in the model's ability to distinguish between positive and negative instances in the binary classification task. This adjustment underscores the importance of selecting appropriate hyperparameters to maximize the model's predictive accuracy and overall performance.

Section 8 AutoML ¶

What's autoML?¶

AutoML, short for Automated Machine Learning, is a powerful tool that automates the process of building and optimizing machine learning models. It streamlines and accelerates the complex steps involved in model selection, feature engineering, hyperparameter tuning, and ensemble building. AutoML algorithms search through a variety of model architectures, preprocessing techniques, and hyperparameter configurations to find the best-performing model for a given task. It reduces the need for manual trial-and-error, making machine learning accessible to a wider range of users, including those without extensive data science expertise. AutoML helps in faster model development, improved model accuracy, and increased efficiency in deploying machine learning solutions across different domains and applications.

In [187]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
H2O_cluster_uptime: 1 hour 46 mins
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.42.0.2
H2O_cluster_version_age: 16 days
H2O_cluster_name: H2O_from_python_unknownUser_dsx5kv
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.166 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"}
H2O_internal_security: False
Python_version: 3.10.12 final
In [188]:
train_df_auto = train_df_rf
test_df_auto = test_df_rf
In [189]:
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_auto.columns.tolist()
predictors=predictors[2:17]
predictors
Out[189]:
['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR009_bin_WOE',
 'CR015_bin_WOE',
 'CR019_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD001_bin_WOE',
 'TD005_WOE',
 'TD006_bin_WOE',
 'TD009_bin_WOE',
 'TD010_bin_WOE',
 'TD014_bin_WOE']
In [190]:
#Use 50% training data
train_smpl = train_df_auto.sample(frac=0.5, random_state=1)
test_smpl = test_df_auto.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

Run AutoML¶

Run AutoML, stopping after 60 seconds. The max_runtime_secs argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.

The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [191]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml_v1 = H2OAutoML(max_runtime_secs = 60, max_models=20, seed=1)
aml_v1.train(predictors,target,training_frame=train_hex)
AutoML progress: |
19:21:17.952: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

████████████
19:21:29.877: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.
19:21:31.109: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

████████
19:21:38.427: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

███████████
19:21:48.894: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

█████████████████
19:22:04.364: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

██
19:22:06.316: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

██
19:22:08.307: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

███
19:22:10.729: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

██
19:22:13.109: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

██████| (done) 100%
Out[191]:
Model Details
=============
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_1_AutoML_1_20230810_192117
GLM Model: summary
family link regularization lambda_search number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
gaussian identity Ridge ( lambda = 0.0117 ) nlambda = 30, lambda.max = 5.7259, lambda.min = 0.0117, lambda.1se = -1.0 15 15 14 AutoML_1_20230810_192117_training_Key_Frame__upload_932d3d6ace027d22c9061cf756964de2.hex
ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.14996128630919411
RMSE: 0.3872483522356088
MAE: 0.3002231894050112
RMSLE: 0.2717943242231826
Mean Residual Deviance: 0.14996128630919411
R^2: 0.04083176836665714
Null degrees of freedom: 22384
Residual degrees of freedom: 22369
Null deviance: 3499.785838731167
Residual deviance: 3356.88339403131
AIC: 21087.069151105054
ModelMetricsRegressionGLM: glm
** Reported on validation data. **

MSE: 0.1475144003497862
RMSE: 0.38407603459443573
MAE: 0.2980889742140967
RMSLE: 0.27059565475447794
Mean Residual Deviance: 0.1475144003497862
R^2: 0.03373079160541492
Null degrees of freedom: 3169
Residual degrees of freedom: 3154
Null deviance: 484.0569529248307
Residual deviance: 467.6206491088222
AIC: 2963.2308537435215
Scoring History:
timestamp duration iteration lambda predictors deviance_train deviance_test alpha iterations training_rmse training_deviance training_mae training_r2 validation_rmse validation_deviance validation_mae validation_r2
2023-08-10 19:21:30 0.000 sec 1 .57E1 16 0.1529860 0.1498662 0.0 None
2023-08-10 19:21:30 0.062 sec 2 .36E1 16 0.1522020 0.1492368 0.0 None
2023-08-10 19:21:30 0.130 sec 3 .22E1 16 0.1515042 0.1486899 0.0 None
2023-08-10 19:21:30 0.204 sec 4 .14E1 16 0.1509490 0.1482623 0.0 None
2023-08-10 19:21:30 0.243 sec 5 .85E0 16 0.1505483 0.1479579 0.0 5 0.3872484 0.1499613 0.3002232 0.0408318 0.3840760 0.1475144 0.2980890 0.0337308
2023-08-10 19:21:30 0.285 sec 6 .53E0 16 0.1502847 0.1477589 0.0 None
2023-08-10 19:21:30 0.319 sec 7 .33E0 16 0.1501273 0.1476415 0.0 None
2023-08-10 19:21:30 0.328 sec 8 .2E0 16 0.1500414 0.1475783 0.0 None
2023-08-10 19:21:30 0.364 sec 9 .13E0 16 0.1499981 0.1475466 0.0 None
2023-08-10 19:21:30 0.375 sec 10 .79E-1 16 0.1499774 0.1475310 0.0 None
2023-08-10 19:21:30 0.384 sec 11 .49E-1 16 0.1499680 0.1475228 0.0 None
2023-08-10 19:21:30 0.392 sec 12 .3E-1 16 0.1499638 0.1475182 0.0 None
2023-08-10 19:21:30 0.399 sec 13 .19E-1 16 0.1499620 0.1475157 0.0 None
2023-08-10 19:21:30 0.407 sec 14 .12E-1 16 0.1499613 0.1475144 0.0 None
2023-08-10 19:21:30 0.414 sec 15 .73E-2 16 0.1499609 0.1475129 0.0 None
Variable Importances:
variable relative_importance scaled_importance percentage
AP003_bin_WOE 0.0292007 1.0 0.1675982
TD009_bin_WOE 0.0257420 0.8815538 0.1477468
CR015_bin_WOE 0.0255326 0.8743841 0.1465452
TD014_bin_WOE 0.0173098 0.5927868 0.0993500
TD005_WOE 0.0148332 0.5079733 0.0851354
PA023_bin_WOE 0.0122045 0.4179545 0.0700484
AP008_WOE 0.0119189 0.4081737 0.0684092
TD001_bin_WOE 0.0106739 0.3655366 0.0612633
PA029_bin_WOE 0.0104481 0.3578045 0.0599674
PA022_bin_WOE 0.0048242 0.1652069 0.0276884
CR019_WOE 0.0043454 0.1488130 0.0249408
TD010_bin_WOE 0.0034180 0.1170529 0.0196178
AP001_WOE 0.0020142 0.0689766 0.0115603
CR009_bin_WOE 0.0010764 0.0368631 0.0061782
TD006_bin_WOE 0.0006883 0.0235725 0.0039507

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.

Leaderboard¶

Next, we will view the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

Now we will view a snapshot of the top models. Here we should see the GLM at the top of the leaderboard.

In [192]:
aml_v1.leaderboard.head()
Out[192]:
model_id rmse mse mae rmsle mean_residual_deviance
GLM_1_AutoML_1_20230810_192117 0.3840760.1475140.2980890.270596 0.147514
GBM_2_AutoML_1_20230810_192117 0.3855050.1486140.2975880.271512 0.148614
GBM_1_AutoML_1_20230810_192117 0.3861790.1491340.2989230.272413 0.149134
GBM_3_AutoML_1_20230810_192117 0.3877040.1503140.2997250.273626 0.150314
GBM_4_AutoML_1_20230810_192117 0.3886050.1510140.2976590.274501 0.151014
XGBoost_3_AutoML_1_20230810_1921170.3893880.1516230.2988360.274869 0.151623
DRF_1_AutoML_1_20230810_192117 0.3996370.15971 0.3085630.285488 0.15971
XRT_1_AutoML_1_20230810_192117 0.4021020.1616860.30966 0.288018 0.161686
XGBoost_2_AutoML_1_20230810_1921170.4189070.1754830.3127660.301935 0.175483
XGBoost_1_AutoML_1_20230810_1921170.4286060.1837030.3189930.311319 0.183703
[10 rows x 6 columns]

The ranking displays the performance of various machine learning models based on different evaluation metrics. Models are compared using metrics such as Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Logarithmic Error (RMSLE), and Mean Residual Deviance. Lower values for these metrics indicate better model performance. In this case, the top-performing model is "GLM_1," with the lowest RMSE, MSE, MAE, RMSLE, and mean residual deviance values, showcasing its accurate predictive ability. The subsequent models, such as "GBM" and "XGBoost," exhibit slightly higher error values. While the "DRF" and "XRT" models have larger errors, the "XGBoost_1" model ranks last with the highest error metrics. Overall, the ranking helps in selecting the best-performing model based on these evaluation criteria.

Predict Using Leader Model¶

If you need to generate predictions on a test set, you can make predictions on the "H2OAutoML" object directly, or on the leader model object.

In [193]:
pred = aml_v1.predict(test_hex)
pred.head()
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[193]:
predict
0.338041
0.29501
0.252813
0.172356
0.228082
0.253299
0.193611
0.0664591
0.101311
0.279401
[10 rows x 1 column]

If needed, the standard model_performance() method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.¶

In [194]:
perf = aml_v1.leader.model_performance(test_hex)
perf
Out[194]:
ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.15068694895577303
RMSE: 0.3881841688628904
MAE: 0.3000688362044224
RMSLE: 0.27298560774821146
Mean Residual Deviance: 0.15068694895577303
R^2: 0.016910672983428743
Null degrees of freedom: 7999
Residual degrees of freedom: 7984
Null deviance: 1226.4295416639784
Residual deviance: 1205.4955916461843
AIC: 7596.610291955
In [205]:
# Get the best model from AutoML
best_model = aml_v1.leader
best_model
Out[205]:
Model Details
=============
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_1_AutoML_1_20230810_192117
GLM Model: summary
family link regularization lambda_search number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
gaussian identity Ridge ( lambda = 0.0117 ) nlambda = 30, lambda.max = 5.7259, lambda.min = 0.0117, lambda.1se = -1.0 15 15 14 AutoML_1_20230810_192117_training_Key_Frame__upload_932d3d6ace027d22c9061cf756964de2.hex
ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.14996128630919411
RMSE: 0.3872483522356088
MAE: 0.3002231894050112
RMSLE: 0.2717943242231826
Mean Residual Deviance: 0.14996128630919411
R^2: 0.04083176836665714
Null degrees of freedom: 22384
Residual degrees of freedom: 22369
Null deviance: 3499.785838731167
Residual deviance: 3356.88339403131
AIC: 21087.069151105054
ModelMetricsRegressionGLM: glm
** Reported on validation data. **

MSE: 0.1475144003497862
RMSE: 0.38407603459443573
MAE: 0.2980889742140967
RMSLE: 0.27059565475447794
Mean Residual Deviance: 0.1475144003497862
R^2: 0.03373079160541492
Null degrees of freedom: 3169
Residual degrees of freedom: 3154
Null deviance: 484.0569529248307
Residual deviance: 467.6206491088222
AIC: 2963.2308537435215
Scoring History:
timestamp duration iteration lambda predictors deviance_train deviance_test alpha iterations training_rmse training_deviance training_mae training_r2 validation_rmse validation_deviance validation_mae validation_r2
2023-08-10 19:21:30 0.000 sec 1 .57E1 16 0.1529860 0.1498662 0.0 None
2023-08-10 19:21:30 0.062 sec 2 .36E1 16 0.1522020 0.1492368 0.0 None
2023-08-10 19:21:30 0.130 sec 3 .22E1 16 0.1515042 0.1486899 0.0 None
2023-08-10 19:21:30 0.204 sec 4 .14E1 16 0.1509490 0.1482623 0.0 None
2023-08-10 19:21:30 0.243 sec 5 .85E0 16 0.1505483 0.1479579 0.0 5 0.3872484 0.1499613 0.3002232 0.0408318 0.3840760 0.1475144 0.2980890 0.0337308
2023-08-10 19:21:30 0.285 sec 6 .53E0 16 0.1502847 0.1477589 0.0 None
2023-08-10 19:21:30 0.319 sec 7 .33E0 16 0.1501273 0.1476415 0.0 None
2023-08-10 19:21:30 0.328 sec 8 .2E0 16 0.1500414 0.1475783 0.0 None
2023-08-10 19:21:30 0.364 sec 9 .13E0 16 0.1499981 0.1475466 0.0 None
2023-08-10 19:21:30 0.375 sec 10 .79E-1 16 0.1499774 0.1475310 0.0 None
2023-08-10 19:21:30 0.384 sec 11 .49E-1 16 0.1499680 0.1475228 0.0 None
2023-08-10 19:21:30 0.392 sec 12 .3E-1 16 0.1499638 0.1475182 0.0 None
2023-08-10 19:21:30 0.399 sec 13 .19E-1 16 0.1499620 0.1475157 0.0 None
2023-08-10 19:21:30 0.407 sec 14 .12E-1 16 0.1499613 0.1475144 0.0 None
2023-08-10 19:21:30 0.414 sec 15 .73E-2 16 0.1499609 0.1475129 0.0 None
Variable Importances:
variable relative_importance scaled_importance percentage
AP003_bin_WOE 0.0292007 1.0 0.1675982
TD009_bin_WOE 0.0257420 0.8815538 0.1477468
CR015_bin_WOE 0.0255326 0.8743841 0.1465452
TD014_bin_WOE 0.0173098 0.5927868 0.0993500
TD005_WOE 0.0148332 0.5079733 0.0851354
PA023_bin_WOE 0.0122045 0.4179545 0.0700484
AP008_WOE 0.0119189 0.4081737 0.0684092
TD001_bin_WOE 0.0106739 0.3655366 0.0612633
PA029_bin_WOE 0.0104481 0.3578045 0.0599674
PA022_bin_WOE 0.0048242 0.1652069 0.0276884
CR019_WOE 0.0043454 0.1488130 0.0249408
TD010_bin_WOE 0.0034180 0.1170529 0.0196178
AP001_WOE 0.0020142 0.0689766 0.0115603
CR009_bin_WOE 0.0010764 0.0368631 0.0061782
TD006_bin_WOE 0.0006883 0.0235725 0.0039507

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [206]:
def createGains(model):
    predictions = model.predict(test_hex)
    test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()

    #sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
    test_scores = test_scores.sort_values(by='predict',ascending=False)
    test_scores['row_id'] = range(0,0+len(test_scores))
    test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
    #see count by decile
    test_scores.loc[test_scores['decile'] == 10]=9
    test_scores['decile'].value_counts()

    #create gains table
    gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
    gains.columns = ['count','actual']
    gains

    #add features to gains table
    gains['non_actual'] = gains['count'] - gains['actual']
    gains['cum_count'] = gains['count'].cumsum()
    gains['cum_actual'] = gains['actual'].cumsum()
    gains['cum_non_actual'] = gains['non_actual'].cumsum()
    gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
    gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
    gains['if_random'] = np.max(gains['cum_actual']) /10
    gains['if_random'] = gains['if_random'].cumsum()
    gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
    gains['K_S'] = np.abs( gains['percent_cum_actual'] -  gains['percent_cum_non_actual'] ) * 100
    gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
    gains = pd.DataFrame(gains)
    return(gains)

createGains(best_model)
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Out[206]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 800 230 570 800 230 570 0.15 0.09 151.2 1.52 6.0 28.75
1 800 197 603 1600 427 1173 0.28 0.18 302.4 1.41 10.0 26.69
2 800 198 602 2400 625 1775 0.41 0.27 453.6 1.38 14.0 26.04
3 800 189 611 3200 814 2386 0.54 0.37 604.8 1.35 17.0 25.44
4 800 147 653 4000 961 3039 0.64 0.47 756.0 1.27 17.0 24.02
5 800 112 688 4800 1073 3727 0.71 0.57 907.2 1.18 14.0 22.35
6 800 109 691 5600 1182 4418 0.78 0.68 1058.4 1.12 10.0 21.11
7 800 124 676 6400 1306 5094 0.86 0.79 1209.6 1.08 7.0 20.41
8 800 107 693 7200 1413 5787 0.93 0.89 1360.8 1.04 4.0 19.62
9 800 99 701 8000 1512 6488 1.00 1.00 1512.0 1.00 0.0 18.90
In [208]:
def ROC_AUC(my_result,df,target):
    from sklearn.metrics import roc_curve,auc
    from sklearn.metrics import average_precision_score
    from sklearn.metrics import precision_recall_curve
    import matplotlib.pyplot as plt

    # ROC
    y_actual = df[target].as_data_frame()
    y_pred = my_result.predict(df).as_data_frame()
    fpr = list()
    tpr = list()
    roc_auc = list()
    fpr,tpr,_ = roc_curve(y_actual,y_pred)
    roc_auc = auc(fpr,tpr)

    # Precision-Recall
    average_precision = average_precision_score(y_actual,y_pred)

    print('')
    print('   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
    print('')
    print('	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
    print('')
    print('   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
    print('')

    # plotting
    plt.figure(figsize=(10,4))

    # ROC
    plt.subplot(1,2,1)
    plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
    plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
    plt.xlim([0.0,1.0])
    plt.ylim([0.0,1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
    plt.legend(loc='lower right')


    # Precision-Recall
    plt.subplot(1,2,2)
    precision,recall,_ = precision_recall_curve(y_actual,y_pred)
    plt.step(recall,precision,color='b',alpha=0.2,where='post')
    plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0,1.05])
    plt.xlim([0.0,1.0])
    plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
    plt.show()

ROC_AUC(best_model,test_hex,'loan_default')
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

   * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate

	  * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Insights:¶

  • After employing AutoML on my dataset, the achieved AUC is 0.6019. The best model identified by AutoML is the GLM with a Lambda value of 0.0117. Although this AUC is slightly lower than that of the GLM with Lambda 0.001, it's important to note that AutoML's model selection process rigorously compared the performance of various machine learning techniques. Consequently, while the specific GLM variant yielded a slightly diminished AUC, the broader AutoML process ensured a comprehensive evaluation, ultimately leading to a well-informed model selection decision.

Section 9 SHAP ¶

'shap' is a Python package that's used for explaining the output of machine learning models. It provides tools for understanding the importance of input features in making predictions.¶

SHAP (SHapley Additive exPlanations) is a technique in explainable AI that quantifies the contribution of each feature to a model's predictions. It calculates values representing how much each feature influences predictions, considering interactions. SHAP values enable clear feature importance ranking and help interpret complex models. They enhance model transparency, aiding users in understanding decision-making processes.

In [162]:
#Concatenate along rows (vertically)
data_shap = pd.concat([train_df_rf, test_df_rf])
data_shap = data_shap.sort_values(by='id', ascending=True)
data_shap
Out[162]:
id loan_default AP001_WOE AP003_bin_WOE AP008_WOE CR009_bin_WOE CR015_bin_WOE CR019_WOE PA022_bin_WOE PA023_bin_WOE PA029_bin_WOE TD001_bin_WOE TD005_WOE TD006_bin_WOE TD009_bin_WOE TD010_bin_WOE TD014_bin_WOE
15109 1 1 0.01 0.07 0.02 0.07 0.19 0.14 -0.15 -0.13 -0.14 -0.24 0.04 -0.14 0.04 -0.24 -0.08
24229 2 0 0.10 0.07 0.09 0.08 -0.27 -0.20 -0.15 -0.13 -0.14 0.02 -0.03 -0.14 -0.18 -0.24 -0.08
56026 3 0 -0.04 -0.50 -0.09 0.07 0.19 0.12 -0.15 -0.13 -0.14 0.02 0.04 -0.14 0.04 -0.24 -0.30
22834 4 0 -0.03 -0.50 0.11 -0.09 0.08 -0.05 -0.15 -0.13 -0.14 -0.24 -0.44 -0.14 -0.49 -0.24 -0.30
2642 5 0 -0.04 0.07 0.09 -0.09 -0.27 -0.20 -0.15 -0.13 -0.14 0.02 -0.22 -0.14 -0.49 -0.24 -0.30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
51386 79996 0 -0.10 0.07 0.02 0.07 0.08 -0.09 -0.15 -0.13 -0.14 0.02 -0.22 -0.14 -0.18 -0.24 0.14
17903 79997 0 0.01 -0.50 0.09 0.08 0.08 0.02 -0.15 -0.13 -0.14 -0.24 -0.22 -0.14 -0.49 -0.24 -0.30
16471 79998 0 -0.14 0.07 0.02 -0.09 0.19 -0.09 -0.15 -0.13 -0.14 -0.24 -0.51 0.11 -0.49 0.00 -0.08
36131 79999 0 -0.05 0.07 -0.09 0.07 0.19 0.02 -0.15 -0.13 -0.14 -0.24 -0.44 -0.14 -0.49 -0.24 -0.30
42494 80000 1 0.04 0.07 0.02 0.07 0.08 0.02 0.22 0.26 0.07 0.39 0.41 0.40 0.17 0.45 0.48

80000 rows × 17 columns

In [163]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

predictors = data_shap.columns.tolist()
predictors=predictors[2:17]
predictors

Y = data_shap['loan_default']
X = data_shap[predictors]

#Train-test split on the features (X) and target (Y) data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
In [164]:
#max_depth=6: This sets the maximum depth of each tree in the forest to 6. It limits how deep each individual tree can grow, helping to control overfitting.the code.
#n_estimators=10: This specifies the number of decision trees (estimators) to create in the random forest ensemble.
model = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
model.fit(X_train, Y_train)  
print(model.feature_importances_)
#which features (input variables) were most influential in making predictions
[0.03332825 0.15531092 0.04499582 0.01275476 0.02906626 0.03585508
 0.02209713 0.03377747 0.07034303 0.01827997 0.11864086 0.00602769
 0.31841822 0.02206263 0.07904192]
In [165]:
importances = model.feature_importances_
indices = np.argsort(importances)

features = X_train.columns
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Keep the top 10 variables (features)¶

In [166]:
predictors = data_shap.columns.tolist()
predictors= ['AP001_WOE',
 'AP003_bin_WOE',
 'AP008_WOE',
 'CR015_bin_WOE',
 'PA022_bin_WOE',
 'PA023_bin_WOE',
 'PA029_bin_WOE',
 'TD005_WOE',
 'TD009_bin_WOE',
 'TD014_bin_WOE']

Y = data_shap['loan_default']
X = data_shap[predictors]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

Calculate the SHAP (SHapley Additive exPlanations) values using a TreeExplainer. The TreeExplainer is designed to work with tree-based models like random forests. The calculated shap_values will provide an explanation for each prediction made by the model, showing how much each feature contributed to that prediction.¶

Second part of the code is calculating the correlation between the SHAP values for each feature and the corresponding actual values of that feature in the training data.¶

In [167]:
#ARCHFLAGS="-arch x86_64" 
#!pip3 install shap
!pip install git+https://github.com/slundberg/shap.git
import shap

#'check_additivity=False' disables the additivity check for faster computation
shap_values = shap.TreeExplainer(model).shap_values(X_train, check_additivity=False)

# Determine the correlation in order to plot with different colors
corrlist = np.zeros(len(predictors))
X_train_np = X_train.to_numpy() # our X_train is a pandas data frame. Convert it to numpy
for i in range(0,len(predictors) ):
    tmp = np.corrcoef(shap_values[:,i],X_train_np[:,i])
    corrlist[i] = tmp[0][1]
Collecting git+https://github.com/slundberg/shap.git
  Cloning https://github.com/slundberg/shap.git to /private/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/pip-req-build-pqzk4pft
  Running command git clone --filter=blob:none --quiet https://github.com/slundberg/shap.git /private/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/pip-req-build-pqzk4pft
  Resolved https://github.com/slundberg/shap.git to commit ec17a2604127c16b83caaf8e3b4d10eeadaa73ee
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.24.1)
Requirement already satisfied: scipy in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.10.1)
Requirement already satisfied: scikit-learn in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.2.2)
Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.5.3)
Requirement already satisfied: tqdm>=4.27.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (4.65.0)
Requirement already satisfied: packaging>20.9 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (23.0)
Requirement already satisfied: slicer==0.0.7 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (0.0.7)
Requirement already satisfied: numba in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (0.57.1)
Requirement already satisfied: cloudpickle in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (2.2.1)
Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from numba->shap==0.42.1) (0.40.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pandas->shap==0.42.1) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pandas->shap==0.42.1) (2022.7.1)
Requirement already satisfied: joblib>=1.1.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from scikit-learn->shap==0.42.1) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from scikit-learn->shap==0.42.1) (3.1.0)
Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from python-dateutil>=2.8.1->pandas->shap==0.42.1) (1.16.0)

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
In [168]:
corrlist
# The correlation coefficient measures the strength and direction of the linear relationship between two variables. 
# In this context, it helps understand how the SHAP values are related to the actual feature values. After this loop completes, corrlist will contain the correlation coefficients for each feature, indicating how much the SHAP values and the actual feature values align.
Out[168]:
array([0.23669973, 0.68691416, 0.78960678, 0.05166419, 0.94047705,
       0.816006  , 0.96310438, 0.72578309, 0.80402505, 0.69202017])

'shap_v_abs' is the absolute value of the SHAP values. This is important because SHAP values can be positive (indicating a feature's positive impact on the prediction) or negative (indicating a feature's negative impact on the prediction). Taking the absolute value helps measure the overall impact of each feature without considering the direction.¶

'shap_v_abs_mean' calculates the mean of the absolute SHAP values for each feature along axis=0. This gives you an idea of the average contribution of each feature across all the instances in the training data.¶

In [169]:
# Calculate the absolute SHAP values
shap_v_abs = np.abs(shap_values)
shap_v_abs_mean = shap_v_abs.mean(axis=0)
In [170]:
shap_v_abs_mean
Out[170]:
array([0.00128068, 0.03765104, 0.00477671, 0.00717649, 0.00687697,
       0.00435646, 0.00428661, 0.00684885, 0.01171429, 0.00281381])
In [171]:
k = pd.DataFrame({'Variables': predictors, 'abs_SHAP': shap_v_abs_mean}).reset_index()
k
Out[171]:
index Variables abs_SHAP
0 0 AP001_WOE 0.001281
1 1 AP003_bin_WOE 0.037651
2 2 AP008_WOE 0.004777
3 3 CR015_bin_WOE 0.007176
4 4 PA022_bin_WOE 0.006877
5 5 PA023_bin_WOE 0.004356
6 6 PA029_bin_WOE 0.004287
7 7 TD005_WOE 0.006849
8 8 TD009_bin_WOE 0.011714
9 9 TD014_bin_WOE 0.002814
In [172]:
shap.summary_plot(shap_values, X_train, plot_type="bar")

Can the above variable importance plot show the directions between the features and the target variable? Yes, that's the power of the Shap value plot as shown below. This plot is made of many dots. Each dot has three characteristics. The graph below plots the SHAP values of every feature for every sample. It shorts features by the total of absolute SHAP values over all samples. The color represents the feature value (red high, blue low).

  • The vertical location shows the feature importance.
  • The horizontal location shows whether the effect of that value caused a higher or lower prediction.
  • Color shows whether that feature was high or low for that observation
In [173]:
def ABS_SHAP(df_shap,df):
    #import matplotlib as plt
    # Make a copy of the input data
    shap_v = pd.DataFrame(df_shap)
    feature_list = df.columns
    shap_v.columns = feature_list
    df_v = df.copy().reset_index().drop('index',axis=1)
    
    # Determine the correlation in order to plot with different colors
    corr_list = list()
    for i in feature_list:
        b = np.corrcoef(shap_v[i],df_v[i])[1][0]
        corr_list.append(b)
    corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
    # Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
    corr_df.columns  = ['Variable','Corr']
    corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
    
    # Plot it
    shap_abs = np.abs(shap_v)
    k=pd.DataFrame(shap_abs.mean()).reset_index()
    k.columns = ['Variable','SHAP_abs']
    k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
    k2 = k2.sort_values(by='SHAP_abs',ascending = True)
    colorlist = k2['Sign']
    ax = k2.plot.barh(x='Variable',y='SHAP_abs',color = colorlist, figsize=(6,4),legend=False)
    ax.set_xlabel("SHAP Value (Red = Positive Impact)")
    
ABS_SHAP(shap_values,X_train)  

The summary_plot¶

In [174]:
shap.summary_plot(shap_values, X_train)
#generates a summary plot using the SHAP values and the training data

We can describe the model. A high probability of loan default is associated with the following characteristics:¶

  • High AP003_bin_WOE
  • Low PA022_bin_WOE
  • High TD009_bin_WOE
  • High TD005_WOE
  • High PA029_bin_WOE
  • High AP008_WOE
  • High PA023_bin_WOE
  • High CR015_bin_WOE
  • High TD014_bin_WOE
  • Low AP001_WOE

To understand how a single feature effects the output of the model we can plot the SHAP value of that feature vs. the value of the feature for all the examples in a dataset. Vertical dispersion at a single value represents interaction effects with other features. To help reveal these interactions dependence_plot automatically selects another feature for coloring.

The dependence_plot¶

A SHAP dependence plot helps you understand how changes in the chosen feature's value influence the model's predictions. It's particularly useful for visualizing non-linear relationships and interactions between the feature and the model's predictions.¶

The resulting dependence plot will typically have two main components:¶

  • Feature Values on x-Axis: The x-axis of the plot represents the feature's values from the X_train data. This gives you an idea of the range of values for the chosen feature.
  • SHAP Values on y-Axis: The y-axis shows the corresponding SHAP values for each data point. This indicates the impact of the feature's value on each prediction.
In [175]:
shap.dependence_plot("TD009_bin_WOE", shap_values, X_train)
# In this case AP003_bin_WOE highlights that it has more impact on loan default than TD009_bin_WOE.
In [176]:
shap_interaction_values = shap.TreeExplainer(model).shap_interaction_values(X_train.iloc[:2000,:])
shap.summary_plot(shap_interaction_values, X_train.iloc[:2000,:])

The force_plot¶

Visualize the given SHAP values with an additive force layout.¶

In [185]:
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train, check_additivity=False)

shap.force_plot(explainer.expected_value, shap_values[0])
Out[185]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
  1. The base value is the average of all output values of the model on the training.
  2. The pink (red) color features are many with small (low importance) values. The plot stacked them all together and shows their values on hover. The values represent how much those features influence the final output of the model.
  3. higher/lower is a caption. It indicates if each feature value influences the prediction to a higher or lower output value.

Descriptions for the top 5 variables¶

These descriptions help clarify how each variable's values impact the likelihood of loan default. The relationships provide insights into which factors contribute positively or negatively to the prediction of loan defaults.¶

  1. AP003_bin_WOE (AP003 - CODE_EDUCATION):

    • This variable represents the education level of applicants.
    • It has a positive relationship with the target variable 'loan default'. Assuming the higher the values are, the lower the education levels they are with,this means that applicants with higher education levels are less likely to default on their loans.
  2. TD009_bin_WOE (TD009 - TD_CNT_QUERY_LAST_3MON_P2P):

    • This variable indicates the count of queries an applicant made to peer-to-peer (P2P) lending platforms in the last 3 months.
    • It has a positive relationship with the target variable 'loan default'. This implies that applicants who frequently query P2P lending platforms in the recent past are more likely to default on their loans.
  3. PA022_bin_WOE (PA022 - DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_OR_HIGH_RISK_CALL):

    • This variable measures the time (in days) between the applicant's loan application and the first collection or high-risk call.
    • It has a negative relationship with the target variable 'loan default'. This suggests that applicants who experience a shorter time gap between application and the first collection or high-risk call are less likely to default on their loans.
  4. CR015_bin_WOE (CR015 - MONTH_CREDIT_CARD_MOB_MAX):

    • Assume this variable represents the maximum month of credit card usage for an applicant.
    • It has a negative relationship with the target variable 'loan default'. This indicates that applicants who have a lower maximum month of credit card usage are less likely to default on their loans.
  5. TD005_WOE (TD005 - TD_CNT_QUERY_LAST_1MON_P2P):

    • This variable indicates the count of queries an applicant made to peer-to-peer (P2P) lending platforms in the last 1 month.
    • It has a positive relationship with the target variable 'loan default'. This means that applicants who recently made more queries to P2P lending platforms are more likely to default on their loans.